A complete personal reference covering undergraduate statistics courses — definitions, theory, formulas, visualizations, applications, and critical notes from lectures and textbooks at Begum Rokeya University, Rangpur.
STAT1101 · Principles of Statistics ISTAT1201 · Principles of Statistics IISTAT1102 · Probability TheorySYAT2102 · Probability DistributionsSTAT2101 · Regression Analysis & DiagnosticsSTAT3203 · EconometricsSTAT2201 · Sampling DistributionSTAT2203 · ANOVA & Design of ExperimentSTAT3201 · Hypothesis TestingSTAT4101 · Multivariate DistributionSTAT4201 · Multivariate Analysis IISTAT4102 · Sampling TechniquesSTAT4106 · Categorical Data AnalysisSTAT4104 · Research Methodology
σ
STAT1101 · STAT1201 · B.Sc. Statistics · BRUR
Principles of Statistics I & II
Statistics & Origin · Central Tendency · Dispersion · Index Numbers · Time Series · Correlation · Regression · Attributes · Shape · Bivariate
Statistics is the science of collecting, organising, analysing, interpreting, and presenting data to make informed decisions and draw conclusions under uncertainty.
💡
Two Branches
Descriptive vs Inferential
Descriptive: Summarises & describes data (means, charts, tables)
Inferential: Draws conclusions about a population from a sample using probability
✅
Where to Use
Applications
Medical research & clinical trials
Economics & finance forecasting
Government census & planning
Machine learning & AI systems
Agriculture & environmental studies
⚠️
Where NOT to Use
Cautions
Predicting individuals with certainty
When data quality is very poor
Proving causation from correlation alone
Non-homogeneous data without caution
Key Quote"Statistics is the grammar of science." — Karl Pearson. It converts raw numbers into knowledge.
· · ·
02
Background
History & Origins
🏛️
Ancient Roots
Early Beginnings
Babylonians collected census data ~3000 BCE
Egyptians used data for pyramid construction planning
Romans conducted systematic population censuses
India: Arthashastra of Kautilya mentions data collection
📜
Modern Development
17th–20th Century
Graunt (1662): Bills of Mortality — first statistical study of births and deaths
Gauss & Laplace: Normal distribution, method of least squares
Pearson: Correlation coefficient r, chi-square test
Fisher: ANOVA, experimental design, p-values, maximum likelihood
EtymologyFrom the Latin statisticum collegium ("council of state") and Italian statista ("statesman") — originally about data useful to the state. The word entered English statistics in the 18th century.
Interval: Equal intervals, no true zero — temperature (°C, °F), IQ
Ratio: True zero exists — weight, height, income, time
Levels of Measurement — Hierarchy
· · ·
04
Practical View
Uses, Importance & Limitations
✅
Major Uses
Why We Use Statistics
Simplifying complex masses of data into meaningful summaries
Comparing groups, phenomena, and time periods
Establishing relationships between variables
Forecasting future trends based on past data
Testing hypotheses scientifically with rigour
⭐
Importance
Why It Matters
Basis for evidence-based policy and decision-making
Essential in every science, social study, and industry
Enables uncertainty quantification and risk assessment
Guides business, economic, and medical decisions
⚠️
Limitations
What Statistics Cannot Do
Deals only with quantifiable, aggregated facts
Results can be misused or deliberately manipulated
Statistical laws apply to groups, not individuals
Requires homogeneous, high-quality data
Cannot prove causation on its own
· · ·
05
Data Collection
Sources of Statistical Data
🔵
Primary Sources
Original Data (First-hand)
Direct personal observation
Questionnaires & structured surveys
Interviews (direct/indirect methods)
Experimental data from controlled studies
Registration systems (births, deaths, marriages)
📂
Secondary Sources
Existing/Published Data
Government publications & national census
Research journals, reports & theses
International agencies (UN, WHO, World Bank, IMF)
Newspapers, almanacs, online databases
💡
Which to Choose?
Primary vs Secondary
Use primary when precision & specificity are critical and budget allows. Use secondary when time/cost are constraints. Always check secondary data for reliability, suitability, and adequacy before use.
· · ·
06
Data Pipeline
Processing & Preprocessing
⚙️
Steps in the Process
Data Processing Pipeline
Editing: Check for errors, omissions, inconsistencies
Coding: Assign numerical values to categorical responses
Classification: Group data into meaningful classes
Tabulation: Arrange data in tables (frequency distributions)
Presentation: Charts, graphs, diagrams for communication
📊
Frequency Distributions
Organising Raw Data
Class interval, class limits, class mark (midpoint)
Class frequency & relative frequency (proportion)
Cumulative frequency (less than / greater than)
Histogram, Frequency Polygon, Ogive (cumulative curve)
Golden Rule of Preprocessing"Garbage in, garbage out." Clean, complete data is the most critical step. Missing values, outliers, and coding errors must be detected and handled before any statistical analysis.
Histogram — Frequency Distribution Concept
· · ·
07
Descriptive Statistics
Measures of Central Tendency
📖
What it is
The "Centre" of Data
A single value representing the typical or central value in a dataset. The three primary measures are Mean, Median, and Mode, each optimal under different data conditions.
🔢
Key Formulas
The Big Five
AM: Σx / n — arithmetic average
Median: Middle value in sorted data
Mode: Most frequently occurring value
GM: (x₁·x₂·…·xₙ)^(1/n) — for ratios, growth
HM: n / Σ(1/xᵢ) — for rates & speeds
✅
When to Use Each
Right Tool, Right Job
Mean: Symmetric data, no extreme outliers, interval/ratio scale
Median: Skewed distributions, income, housing prices, ordinal data
Mode: Categorical data, most popular item, bimodal distributions
GM: Ratios, growth rates, compound interest, index numbers
HM: Averaging rates, speeds, prices per unit
⚠️
Cautions
Common Mistakes
Mean is highly sensitive to outliers — check for skewness first
Mode may not exist or may not be unique (bimodal)
Never compute the mean for nominal or ordinal data
AM ≥ GM ≥ HM always (equality only when all values equal)
InequalityHM ≤ GM ≤ AM (always; equality iff all xᵢ equal)
Median (odd n)M = x₍(n+1)/2₎ after sorting
Median (even n)M = [x₍n/2₎ + x₍n/2+1₎] / 2
Central Tendency — Symmetric vs Skewed Distributions
· · ·
08
Spread of Data
Measures of Dispersion
📏
What it is
Quantifying Variability
Dispersion measures the spread or variability in a dataset. Two datasets can have the same mean but vastly different spreads — dispersion captures this critical difference.
⚙️
All Measures
Absolute & Relative
Range: Max − Min (simplest; very sensitive to outliers)
Quartile Deviation (QD): (Q3−Q1)/2
Mean Deviation (MD): Σ|x−x̄| / n
Variance (σ²): Σ(x−x̄)² / n or s² = Σ(x−x̄)² / (n−1)
Std Deviation (σ): √Variance
Coeff. of Variation (CV): (σ/x̄)×100 — unit-free comparator
💡
Main Idea
Absolute vs Relative
Absolute: Range, SD, Variance — in original units; cannot compare datasets with different units
Relative: CV — unit-free percentage; use to compare variability across different datasets
Population Varianceσ² = (1/N) · Σᵢ(xᵢ − μ)²
Sample Variances² = (1/(n−1)) · Σᵢ(xᵢ − x̄)²
Std Deviationσ = √[ Σᵢ(xᵢ − x̄)² / N ]
Computing formulaσ² = (1/n)Σxᵢ² − x̄²
Coeff. of VariationCV = (σ / x̄) × 100%
Quartile DeviationQD = (Q3 − Q1) / 2
· · ·
09
Economic Measurement
Index Numbers
📈
What it is
Relative Change Measure
An index number measures the relative change in a variable (or group) compared to a base period. Expressed as a percentage relative to the base (base period = 100). Used to track changes over time.
⚙️
Types
Key Methods
Laspeyres Index: Uses base-period quantities as weights
Paasche Index: Uses current-period quantities as weights
Fisher's Ideal Index: Geometric mean of Laspeyres & Paasche — satisfies time reversal & factor reversal tests
Value index: Ratio of current to base-period value
✅
Real-World Use
Applications
Consumer Price Index (CPI) — measuring inflation
Stock market indices (S&P 500, BSE Sensex)
Human Development Index (HDI)
Adjusting wages for purchasing power
Laspeyres P-IndexL = (Σ p₁q₀) / (Σ p₀q₀) × 100
Paasche P-IndexP = (Σ p₁q₁) / (Σ p₀q₁) × 100
Fisher Ideal IndexF = √(L × P)
Simple Price Rel.P₀₁ = (p₁ / p₀) × 100
· · ·
10
Temporal Data
Time Series Basics
🕐
What it is
Data Over Time
A time series is a sequence of data points collected at successive, equally-spaced time intervals. Goal: identify patterns, decompose components, and forecast future values.
💡
4 Components
Decomposition (TSCI)
Trend (T): Long-term direction (upward/downward/stationary)
Seasonal (S): Regular periodic fluctuations within a year
Cyclical (C): Long-run waves lasting 2–10 years (business cycles)
Moving averages: Simple smoothing of irregular fluctuations
Least squares: Fit linear/polynomial trend equation
Exponential smoothing: Weighted past observations
Additive ModelY = T + S + C + I
Multiplicative ModelY = T × S × C × I
Trend Line (OLS)Ŷ = a + bt (t = coded time)
3-Period Moving AvgMA₃ = (Yₜ₋₁ + Yₜ + Yₜ₊₁) / 3
· · ·
11
Bivariate Analysis
Correlation
🔗
What it is
Measuring Association
Correlation measures the strength and direction of the linear relationship between two variables. The Pearson correlation coefficient r ranges from −1 to +1.
💡
Types
Types of Correlation
Positive (r > 0): Both variables increase together
Negative (r < 0): One increases, other decreases
Zero (r = 0): No linear relationship
Perfect (r = ±1): All points on a straight line
⚙️
Methods
How to Compute
Pearson's r: For interval/ratio data with linear relation
Spearman's ρ: For ordinal/ranked data or non-linear relations
Scatter diagram: Always plot first — visualise the relationship
⚠️
Critical Warning
Correlation ≠ Causation
High correlation does not prove one variable causes the other. A lurking (confounding) variable may drive both. Always investigate mechanism and theory before claiming causation.
r² (Coeff. of Det.)r² = Explained variation / Total variation
Correlation Strength — Scatter Plot Patterns
· · ·
12
Prediction
Regression Analysis
📉
What it is
Line of Best Fit
Regression establishes a mathematical relationship to predict the value of a dependent variable (Y) from an independent variable (X) using the principle of Ordinary Least Squares (OLS).
⚙️
OLS Principle
Minimising Residuals
OLS minimises the sum of squared residuals (SSE) — the vertical distances between observed Y and predicted Ŷ. This gives the unique best-fit line through the data. Two regression lines exist: Y on X, and X on Y; they intersect at (x̄, ȳ).
Attributes are qualitative characteristics (literacy, colour, gender, disease) that are categorised rather than measured. Analysis counts classes and tests association between categories.
⚙️
Methods
Statistical Tools
Contingency tables: Cross-tabulation of two attributes
χ² test: Tests independence between attributes
Yule's Q: Coefficient of association (−1 to +1)
Consistency check: Ensure all class frequencies ≥ 0
💡
Association
When Are Attributes Related?
Two attributes are associated if their joint frequency differs from expectation under independence. Positive association: both present together more than chance. Negative: inversely linked.
Chi-square Testχ² = Σ (O − E)² / E (O=observed, E=expected)
Expected FrequencyE = (Row total × Column total) / Grand total
Yule's QQ = (AD − BC) / (AD + BC)
· · ·
14
Distribution Shape
Shape Characteristics — Skewness & Kurtosis
〰️
Skewness
Asymmetry of Distribution
Symmetric (Sk=0): Mean = Median = Mode
Positive skew (+): Mean > Median > Mode — right tail longer
Negative skew (−): Mean < Median < Mode — left tail longer
📐
Kurtosis
Peakedness (Tailedness)
Mesokurtic (β₂=3): Normal distribution — standard shape
Leptokurtic (β₂>3): More peaked, heavier tails than normal
Platykurtic (β₂<3): Flatter peak, lighter tails than normal
Kurtosis β₂β₂ = μ₄ / σ⁴ (4th central moment / (σ²)²)
Excess Kurtosisγ₂ = β₂ − 3 (= 0 for normal)
· · ·
15
Two-Variable Analysis
Bivariate Distribution
📊
What it is
Joint Distribution of (X, Y)
A bivariate distribution shows the joint frequency distribution of two variables simultaneously — revealing their individual behaviour AND their joint patterns and dependence structure.
⚙️
Key Concepts
Components
Marginal distributions: Distribution of each variable alone (sum over other)
Conditional distributions: One variable given the other fixed
Bivariate normal: 2D bell curve — described by μₓ, μᵧ, σₓ, σᵧ, and ρ
💡
Why It Matters
Bridge to Multivariate
Bivariate analysis is the essential bridge between single-variable and multivariate statistics. Correlation and regression both rest on understanding the bivariate joint distribution of (X, Y).
P(A|B) = the probability of A given that B has already occurred. We restrict the sample space to B and measure A within it. This is the "updated" probability with new information.
💡
Independence
When Knowledge Changes Nothing
A and B independent iff P(A|B) = P(A)
Equivalently: P(A ∩ B) = P(A) · P(B)
Independence ≠ mutual exclusivity
Mutually exclusive events with P>0 are never independent
🔄
Bayes' Theorem
Reversing Conditional Probability
Given P(E|H) we find P(H|E). We update prior belief P(H) with evidence E to get posterior P(H|E). Used in: medical diagnosis, spam filtering, ML classifiers.
✅
Law of Total Probability
Averaging Over Causes
If {H₁,…,Hₙ} is a partition of S, then: P(E) = Σᵢ P(E|Hᵢ)·P(Hᵢ). The denominator of Bayes' theorem — the total probability of the evidence.
Bayes IntuitionA medical test is positive. Bayes tells you the true probability of actually having the disease, accounting for the test's false positive rate AND the disease prevalence (prior). Without Bayes, most people vastly overestimate their risk.
· · ·
P5
Core Theory
Random Variables & Mathematical Expectation
🎯
Random Variable
Mapping Outcomes to Numbers
X: S → ℝ assigns a real number to each sample point. Capital X = the RV (function); lowercase x = the value it takes. Converts non-numeric experiments into numbers for analysis.
💡
Discrete vs Continuous
Two Types of RVs
Discrete: Countable values {0,1,2,…} — described by PMF p(x)
Continuous: Any value in an interval — described by PDF f(x)
CDF F(x) = P(X ≤ x) exists for both types
⚙️
Expectation & Moments
Summary Measures
E(X): Probability-weighted average — the "centre of gravity"
Var(X) = E(X²) − [E(X)]²
rth raw moment: μ'ᵣ = E(Xʳ)
rth central moment: μᵣ = E[(X−μ)ʳ]
Linearity: E(aX+b) = aE(X)+b
🔢
Covariance & Correlation
Between Two RVs
Cov(X,Y) = E(XY) − E(X)·E(Y)
ρ(X,Y) = Cov(X,Y) / (σ_X·σ_Y)
Independent → Cov = 0 (not always vice versa)
E(X) — discreteΣₓ x · p(x) where Σ p(x)=1
E(X) — continuous∫₋∞^∞ x · f(x) dx where ∫f(x)dx=1
One trial, two outcomes: 1 (success) with prob p, 0 with prob (1−p)
E(X) = p; Var(X) = p(1−p)
Building block for Binomial
🎰
Binomial(n, p)
n Independent Bernoulli Trials
Counts number of successes in n independent trials
P(X=k) = C(n,k)·pᵏ·(1−p)ⁿ⁻ᵏ
E(X) = np; Var(X) = np(1−p)
Use when: fixed n, each trial independent, constant p
☎️
Poisson(λ)
Rare Events in Time/Space
Counts events in a fixed interval (time, area, volume)
P(X=k) = e⁻λ·λᵏ / k!
E(X) = Var(X) = λ — unique equal mean & variance!
Use for: calls/hour, defects/unit, accidents/year
🔢
Geometric(p)
Waiting for First Success
P(X=k) = (1−p)^(k−1)·p where k=1,2,3,…
E(X) = 1/p; Var(X) = (1−p)/p²
Memoryless: P(X>s+t|X>s) = P(X>t)
Use for: number of trials to first success
Binomial PMFP(X=k) = C(n,k) · pᵏ · (1−p)ⁿ⁻ᵏ
Binomial Mean/VarE(X) = np ; Var(X) = np(1−p)
Poisson PMFP(X=k) = e⁻λ · λᵏ / k! (k=0,1,2,…)
Poisson Mean/VarE(X) = Var(X) = λ
Geometric PMFP(X=k) = (1−p)^(k−1) · p
HypergeometricP(X=k) = C(K,k)·C(N−K,n−k) / C(N,n)
Binomial(10, 0.3) vs Poisson(3) — PMF Comparison
· · ·
D2
Continuous Distributions
Normal, Exponential, Uniform, Gamma & Beta
🔔
Normal N(μ, σ²)
The Bell Curve — Most Important
Symmetric about mean μ; inflection points at μ±σ
68-95-99.7 rule for 1σ, 2σ, 3σ from mean
Standard Normal Z ~ N(0,1): Z = (X−μ)/σ
Central Limit Theorem: sample means → Normal
⏱️
Exponential(λ)
Time Until First Event
f(x) = λe⁻λˣ for x ≥ 0
E(X) = 1/λ; Var(X) = 1/λ²
Memoryless: P(X>s+t|X>s) = P(X>t)
Continuous analog of geometric distribution
📐
Uniform U(a, b)
Equal Probability Everywhere
f(x) = 1/(b−a) for a ≤ x ≤ b
E(X) = (a+b)/2; Var(X) = (b−a)²/12
All values equally likely in [a, b]
🌀
Gamma & Beta
Flexible Family Distributions
Gamma(α,β): Generalises exponential; waiting time for αth event. E(X)=αβ
Beta(α,β): Defined on [0,1]; used for proportions, probabilities. Very flexible shape.
Normal PDFf(x) = (1/σ√2π) · exp[−(x−μ)²/(2σ²)]
Standard Normal ZZ = (X − μ) / σ ~ N(0,1)
Exponential PDFf(x) = λ·e⁻λˣ , x ≥ 0 ; E(X)=1/λ
Uniform PDFf(x) = 1/(b−a) for x ∈ [a,b]
Gamma PDFf(x) = xᵅ⁻¹·e⁻ˣ/ᵝ / [βᵅ·Γ(α)] , x>0
Beta PDFf(x) = xᵅ⁻¹(1−x)ᵝ⁻¹/B(α,β) , x∈[0,1]
Normal Distribution — The 68-95-99.7 Empirical Rule
Which Distribution to Use?Binary single trial → Bernoulli. Counting successes in n fixed trials (replacement) → Binomial. Rare events in time/space → Poisson. Waiting for first event → Geometric/Exponential. Without replacement → Hypergeometric. Heights, errors, averages → Normal. Waiting for αth event → Gamma. Proportions → Beta.
Y = β₀ + β₁X + ε. We model the linear relationship between a response Y (dependent) and a predictor X (independent), where ε is random error. We estimate β₀ & β₁ from sample data.
⚙️
Model Assumptions
LINE Assumptions
L — Linearity: True relationship is linear in X
I — Independence: Errors εᵢ are independent
N — Normality: Errors ~ N(0, σ²)
E — Equal variance: Var(εᵢ) = σ² (homoscedasticity)
💡
Interpretation
Meaning of Coefficients
β₀ (intercept): Expected value of Y when X = 0
β₁ (slope): Change in E(Y) for each 1-unit increase in X
Sign of β₁ tells direction; magnitude tells strength
✅
Where to Use
Regression Applications
Predicting outcomes (sales, yield, price) from predictors
Quantifying effect size of a predictor on outcome
Controlling for confounders in observational studies
Simple Linear Regression — Fitted Line & Residuals
· · ·
R2
Estimation
OLS Estimation & BLUE Properties
⚙️
OLS Principle
Minimise Sum of Squared Errors
We choose b₀ and b₁ to minimise SSE = Σ(Yᵢ − b₀ − b₁Xᵢ)². Taking partial derivatives and setting to zero gives the normal equations, leading to closed-form solutions.
💡
Gauss-Markov Theorem
BLUE Estimators
Under the LINE assumptions, OLS estimators are Best Linear Unbiased Estimators (BLUE). They have the smallest variance among all linear unbiased estimators. This is the most important theorem in regression.
📊
Variance Decomposition
SST = SSR + SSE
SST: Total sum of squares = Σ(Yᵢ−ȳ)²
SSR: Regression SS = Σ(Ŷᵢ−ȳ)² (explained by model)
PI for new Y: Ŷ ± t · s·√[1 + 1/n + (x*−x̄)²/Sxx] — wider!
💡
CI vs Prediction Interval
Key Distinction
CI for mean E(Y|x*) is narrower — for the average at x*. Prediction interval (PI) is wider — for an individual future observation. PI includes extra uncertainty from ε. Both narrow near x̄, widen as x* moves away.
🔢
F-Test
Overall Model Significance
H₀: All β₁ = … = βₖ = 0 (no predictors help)
F = MSR / MSE ~ F(k, n−k−1) under H₀
Equivalent to t-test in simple regression (F = t²)
t-statistic for β₁t = b₁ / [s / √Sxx] ~ t(n−2)
SE(b₁)SE(b₁) = s / √Sxx where s = √MSE
CI for β₁b₁ ± t_(α/2, n−2) · SE(b₁)
F for overall modelF = MSR / MSE = (SSR/k) / (SSE/(n−k−1))
· · ·
R4
Variance Partitioning
ANOVA Table for Regression
ANOVA Table Structure — Simple Linear Regression
· · ·
R5
Model Checking
Residual Analysis & Diagnostics
🔬
Why Diagnostics?
Checking Model Assumptions
Residuals eᵢ = Yᵢ − Ŷᵢ carry information about assumption violations. Always plot residuals before trusting inference. A good model has residuals that look like random noise.
📊
Key Diagnostic Plots
4 Essential Plots
Residuals vs Fitted (Ŷᵢ): Check linearity & homoscedasticity. Should be random scatter around zero.
Normal Q-Q plot: Check normality of residuals. Points should lie on a straight diagonal line.
Scale-Location plot: √|eᵢ| vs Ŷᵢ — check homoscedasticity.
Residuals vs Leverage: Identify influential points & Cook's D.
💡
Standardised Residuals
Types of Residuals
Ordinary: eᵢ = Yᵢ − Ŷᵢ (raw residuals)
Standardised: rᵢ = eᵢ / (s√(1−hᵢᵢ)) — scale-free; should be within ±2
Studentised deleted: rᵢ* — uses s₍ᵢ₎ without point i — best for outlier detection
VIF (Var. Inflation)VIF_j = 1/(1 − Rj²) where Rj² = R² of Xj on all other predictors
Breusch-Pagan TestRegress eᵢ² on Xᵢ; test F or nR² ~ χ²(k)
· · ·
R7
Outlier Detection
Influential Points, Outliers & Leverage
🎯
Outliers in Y
Large Residuals
A point with a large studentised residual |rᵢ| > 2 or 3. Outliers in Y can inflate MSE and distort regression estimates. Check if real or data error.
🔭
High Leverage Points
Outliers in X Space
Leverage hᵢᵢ (hat matrix diagonal) measures how far Xᵢ is from x̄. Rule of thumb: hᵢᵢ > 2(k+1)/n signals high leverage. High leverage = potential for high influence.
💡
Cook's Distance D
Overall Influence
Cook's D measures the effect of deleting point i on ALL fitted values. D > 1 (or D > 4/n) suggests the point is influential. Combines residual size and leverage: a high-leverage point with large residual is most influential.
🔢
DFFITS & DFBETAS
Change-in-Fit Statistics
DFFITS: Change in Ŷᵢ when point i is deleted (standardised)
Y = β₀ + β₁X₁ + β₂X₂ + … + βₖXₖ + ε. Each βⱼ is the partial effect of Xⱼ on Y, holding all other predictors constant. Estimated by matrix algebra: b = (X'X)⁻¹X'Y.
💡
Adjusted R²
Penalised Fit Measure
R² always increases when adding predictors (even irrelevant ones). Adjusted R² penalises for the number of predictors — use this to compare models with different numbers of predictors.
⚙️
Model Selection
Choosing Predictors
Forward selection: Add predictors one at a time
Backward elimination: Remove least significant predictors
Stepwise: Combine both directions
AIC/BIC: Information criteria — lower is better
Cross-validation: Out-of-sample prediction error
🎯
Logistic Regression
Binary Response Variable
When Y ∈ {0,1}, linear regression is inappropriate. Use logistic regression: log[p/(1−p)] = β₀ + β₁X₁ + …. Coefficients interpreted as log-odds; exp(βⱼ) = odds ratio. Estimated by MLE, not OLS.
Odds RatioOR_j = exp(βⱼ) — effect of 1-unit increase in Xⱼ on odds of Y=1
OLS vs LogisticUse OLS regression when Y is continuous (approximately). Use logistic regression when Y is binary (0/1). Never fit a linear regression to a binary outcome — it can predict probabilities outside [0,1] and violates the normality/homoscedasticity assumptions.
STAT3203 · Econometrics
Y
STAT3203 · B.Sc. Statistics Year 3 · BRUR
Econometrics
Classical Linear Model · OLS · Multicollinearity · Heteroscedasticity · Autocorrelation · Specification Errors · Dummy Variables · Simultaneous Equations · Time Series
🎓 What is Econometrics?
Econometrics is what happens when statistics and economics go on a date and have a baby called "regression." It asks: "Yes, we think X causes Y in theory — but how strong is that relationship in actual data, and can we prove it?" As Gujarati puts it: "Econometrics is the art and science of using statistical methods to test economic theories and forecast economic phenomena." The joke among economists: "Economists use models to explain what has already happened, and models to predict the future — and the same model is usually wrong in both cases." 😄
Econometrics = Economics + Metrics. It applies statistical and mathematical methods to quantify economic relationships, test economic theories, and forecast future economic activity. Gujarati defines it as the "quantitative analysis of actual economic phenomena."
⚙️
The Three Steps
Econometric Methodology
1. Economic model: Theory says Y depends on X₁, X₂,… (e.g., consumption depends on income)
2. Econometric model: Add error term — Y = f(X₁,X₂) + ε
3. Estimate & test: Use data to estimate parameters and test hypotheses
💡
Real World Example
Keynesian Consumption Function
Theory: Consumption increases with income. Econometric model: C = β₀ + β₁Y + ε β₁ = Marginal Propensity to Consume (MPC) — how much of each extra taka is consumed. We estimate this from real survey data!
😂 Econometrician's Joke"An economist, a physicist, and an econometrician are stranded on an island with canned food. The physicist says 'let's use a rock to open the cans.' The economist says 'assume we have a can opener.' The econometrician says 'let's regress can-opening on island conditions, correct for heteroscedasticity, and check the instrumental variables.'" — Econometrics solves real problems, just very thoroughly! 😄
· · ·
E2
Foundation
Classical Linear Regression Model (CLRM)
🎯 The CLRM is the BackboneEvery econometrics problem starts by asking: "Which CLRM assumption is violated here?" Like a doctor checking vital signs before treating a patient — you must check the assumptions before trusting the results.
📋
The 10 Assumptions
CLRM Assumptions (Gujarati)
A1: Linear in parameters — model is linear in β (not necessarily in X)
A2: Fixed X values — X is non-stochastic (or fixed in repeated sampling)
A3: Zero mean error — E(εᵢ) = 0
A4: Homoscedasticity — Var(εᵢ) = σ² (constant)
A5: No autocorrelation — Cov(εᵢ, εⱼ) = 0, i≠j
A6: X non-stochastic — Cov(εᵢ, Xᵢ) = 0
A7: n > k — more observations than parameters
A8: Variability in X — Var(X) ≠ 0
A9: No perfect multicollinearity — no exact linear relation among Xs
A10: Normality of ε — εᵢ ~ N(0, σ²)
💡
LINE Simplified
Remember: LINE
Linearity — relationship is linear in parameters
Independence — errors are independent of each other
Normality — errors are normally distributed
Equal variance — errors have constant variance (homoscedastic)
😄 Memory tip: "LINE up your assumptions or your results will be crooked!"
⚠️
What Happens When Violated
Consequences Table
A4 violated (hetero): OLS unbiased but inefficient; wrong SEs
A5 violated (autocorr): OLS unbiased but inefficient; wrong SEs
A9 violated (multicoll): OLS unbiased but very large variance; unreliable estimates
Omitted variable: OLS biased AND inconsistent — the worst!
🌍
Real Scenario
Estimating Wage Equation
Model: Wage = β₀ + β₁Education + β₂Experience + ε Check: Does error have constant variance? (Workers with more education may have more variable wages → heteroscedasticity). Are education & experience correlated? (Older workers often have more experience AND education → multicollinearity). Always diagnose first!
OLS chooses β̂ to minimise SSE = Σeᵢ² = Σ(Yᵢ − Ŷᵢ)². The "squaring" penalises large errors more — like a strict teacher who really hates big mistakes more than small ones! 😄 The solution is unique and closed-form.
🏆
Gauss-Markov Theorem
BLUE — Why OLS is Best
Under assumptions A1–A9 (without normality), OLS estimators are: Best — minimum variance Linear — in Y Unbiased — E(β̂) = β Estimators No other linear unbiased estimator has smaller variance! Think of it as OLS being the "most efficient honest statistician."
⚙️
OLS Properties
Algebraic Properties
Σeᵢ = 0 (residuals sum to zero)
Σeᵢ·Xᵢ = 0 (residuals uncorrelated with X)
Regression line passes through (X̄, Ȳ)
Σeᵢ·Ŷᵢ = 0 (residuals uncorrelated with fitted values)
📐
Goodness of Fit
R² and its Limits
R² ∈ [0,1]; R²=1 perfect fit; R²=0 model explains nothing
Warning: High R² ≠ good model! You can have high R² with spurious regression (two random trends)
Adjusted R²: Penalises for extra predictors — use for model comparison
😄 "A high R² in time series is suspicious, not impressive!"
🌍 Real WorldBangladesh rice yield data: Yield = 1200 + 45·Fertiliser + 30·Rain + ε. R² = 0.82 means 82% of variation in yield is explained by fertiliser and rainfall. β̂₂=45 means: holding rain constant, each kg of fertiliser per acre increases yield by 45 kg. This directly guides agricultural policy!
· · ·
E4
Problem 1
Multicollinearity — The Identity Crisis
😂 The Multicollinearity Joke"Multicollinearity is like trying to tell apart identical twins by asking their friends — everyone says 'they're basically the same.' Your model literally cannot figure out who is doing what." When X₁ and X₂ are nearly perfectly correlated, the model gets confused about whose "fault" it is when Y changes.
🔍
What it is
Correlated Predictors
Multicollinearity occurs when two or more predictor variables are highly correlated with each other. Perfect multicollinearity = exact linear relationship (OLS breaks down entirely). Near-perfect = high but not perfect correlation (OLS works but gives unreliable estimates).
⚙️
Detection
How to Detect
Correlation matrix: |rᵢⱼ| > 0.8 between predictors — warning sign
🌍 Bangladesh ExampleRegressing household expenditure on income and wealth. Income and wealth are highly correlated (r=0.92). VIF comes out at 8.2. The model can't tell apart the separate effects of income vs wealth. Solution: use only income, or create a composite "socioeconomic status" score. 😄 Tip: "If two variables always go up together in your data, your model has the same problem as a detective who always finds two suspects at the crime scene at the same time — it cannot tell who did it."
· · ·
E5
Problem 2
Heteroscedasticity — The Unequal Spreader
😄 Analogy"Heteroscedasticity is like a group of students whose test scores vary wildly for rich students (some study hard, some don't) but are very consistent for poor students (all must study). The variance of the 'error' in predicting scores is not equal across income groups." This violates A4!
📡
What it is
Non-constant Error Variance
Heteroscedasticity means Var(εᵢ) = σᵢ² — the variance of the error term is NOT constant across observations. It changes with one or more predictors. Very common in cross-sectional data (individuals, firms, countries with very different sizes).
WLS objectiveMinimise: Σ wᵢeᵢ² where wᵢ = 1/σᵢ² (higher weight = more precise obs.)
Breusch-PaganBP = n·R² ~ χ²(k) from regressing eᵢ²/σ̂² on all X's
White test statn·R² ~ χ²(p) where p = number of regressors in auxiliary regression
🌍 Real ExampleRegressing household food expenditure on income across 1000 Bangladeshi families. Rich families have very variable food spending (some eat lavishly, some save); poor families all spend similarly near subsistence. This creates a fan shape in residuals — classic heteroscedasticity. Remedy: use ln(expenditure) or WLS with weight 1/income².
· · ·
E6
Problem 3
Autocorrelation — The Time Traveller's Problem
😄 The Autocorrelation Joke"Autocorrelation is like a gossip chain. What happened yesterday affects what people say today, which affects tomorrow. Errors in time series data are like rumours — yesterday's error whispers to today's error." When today's residual tells tomorrow's what to be, you have autocorrelation!
🔗
What it is
Correlated Error Terms
Autocorrelation (serial correlation) means Cov(εᵢ, εⱼ) ≠ 0 for i≠j. Violations of assumption A5. Most common in time series data (monthly GDP, daily stock prices, annual inflation). Positive autocorrelation is most common — errors persist in the same direction.
⚙️
Detection
Tests for Autocorrelation
Plot residuals over time: Look for cyclical or trending patterns
Durbin-Watson (DW) test: d ≈ 2 → no autocorrelation; d < 1.5 → positive AC; d > 2.5 → negative AC
Breusch-Godfrey (BG) test: More general — detects higher-order autocorrelation
Run test: Non-parametric test for randomness in residuals
⚠️
Consequences
What Goes Wrong
OLS estimates remain unbiased and consistent
But NOT BLUE — inefficient; larger variances than GLS
s² underestimates σ² → t & F tests give inflated significance
R² is overestimated — model looks better than it is!
💡
Remedies
Fixing Autocorrelation
Generalised Least Squares (GLS): Use the transformed model (most correct)
Cochrane-Orcutt method: Iterative GLS for AR(1) errors
Include lagged Y (Yₜ₋₁): Often removes autocorrelation
Newey-West HAC SEs: Robust SEs that account for autocorrelation
First-differencing: Use ΔY = Yₜ − Yₜ₋₁ as the dependent variable
AR(1) error processεₜ = ρεₜ₋₁ + uₜ where |ρ| < 1 and uₜ ~ WN(0, σ²)
🌍 Bangladesh ExampleRegressing annual rice production on fertiliser use and rainfall (1980–2023). The DW statistic = 1.12 signals positive autocorrelation — a good crop year tends to be followed by another good year (farmers reinvest; soil quality persists). Cochrane-Orcutt iteration gives ρ̂ = 0.48, and the corrected model gives more reliable coefficient estimates.
· · ·
E7
Model Misspecification
Specification Errors — Building the Wrong House
🏗️
What it is
Using the Wrong Model
Specification errors arise when the model is incorrectly specified — wrong variables, wrong functional form, or wrong structural assumptions. The most dangerous error in econometrics!
⚠️
Type 1: Omitted Variable
Leaving Out a Key Variable
True model: Y = β₁ + β₂X₂ + β₃X₃ + ε
Estimated model: Y = α₁ + α₂X₂ + u (X₃ omitted)
Result: OLS estimator of β₂ is biased and inconsistent
Bias direction depends on correlation between X₂ and X₃
😄 "Like measuring height but ignoring whether you're on a slope!"
⚠️
Type 2: Irrelevant Variable
Including an Unnecessary Variable
True model: Y = β₁ + β₂X₂ + ε
Estimated model: Y = α₁ + α₂X₂ + α₃X₃ + u (X₃ is irrelevant)
Result: OLS estimators remain unbiased but inefficient (larger variance)
R² increases artificially — use adjusted R² instead!
⚙️
Type 3: Wrong Functional Form
Linear When Non-linear
True: Y = β₁ + β₂X + β₃X² + ε (quadratic)
Fitted: Y = α₁ + α₂X + u (linear)
Residuals will show a curved pattern
RESET test (Ramsey) detects wrong functional form
💡
Detecting Specification Errors
Tests
RESET test: Add Ŷ², Ŷ³ to model; test their joint significance
Davidson-MacKinnon J-test: Test between non-nested models
RESET testAdd Ŷ², Ŷ³ to regression; F-test on their coefficients. Reject H₀ → misspecification.
🌍 Classic ExampleWage regression omitting "ability." Model: Wage = β₀ + β₁Education + ε. Problem: Ability affects both wages AND education choices. Omitting ability biases β̂₁ upward — we attribute to education some of what is really due to innate ability. This is the classic "ability bias" in returns to education. Solution: use IQ scores, sibling fixed effects, or instrumental variables (Angrist & Krueger's famous quarter-of-birth IV).
· · ·
E8
Qualitative Predictors
Dummy Variables — Turning Categories into Numbers
💡 What is a Dummy Variable?A dummy (indicator) variable takes values 0 or 1 to represent a categorical characteristic. Male = 1, Female = 0. Urban = 1, Rural = 0. It's called "dummy" because it's a stand-in number for something that isn't naturally numeric. 😄 "It's not that the variable is stupid — it's just pretending to be a number!"
🔢
What it is
Binary Indicator Variables
For a qualitative variable with m categories, we include m−1 dummy variables (omit one — the "base" or "reference" category). Including all m dummies causes perfect multicollinearity — the dummy variable trap!
β₃ allows the slope of education to differ by gender
Male return to education: β₂ + β₃
Female return to education: β₂
This is the Chow test idea — testing if two groups have different regression relationships
⚠️
Dummy Trap
The Most Common Mistake!
For m categories, ALWAYS include m−1 dummies. If you include all m, the sum of all dummies = 1 (a constant) which creates PERFECT multicollinearity. Example: if you have MALE and FEMALE dummies, they always sum to 1 = the intercept column → perfect collinearity. Drop one! The dropped category is the "reference group."
General formYᵢ = β₀ + β₁Dᵢ + β₂Xᵢ + εᵢ (D=1 for group A, D=0 for group B)
Group A meanE(Yᵢ|Dᵢ=1,Xᵢ) = (β₀+β₁) + β₂Xᵢ (shifted intercept)
Group B meanE(Yᵢ|Dᵢ=0,Xᵢ) = β₀ + β₂Xᵢ (reference group)
Chow Test F-statF = [(SSEᵣ − (SSE₁+SSE₂))/k] / [(SSE₁+SSE₂)/(n₁+n₂−2k)]
🌍 Bangladesh Policy ExampleEvaluating impact of a microfinance program: Treatment = 1 (received loan), Control = 0. Model: Income = β₀ + β₁·Treatment + β₂·Education + β₃·Age + ε. β₁ estimates the Average Treatment Effect (ATE) — did the loan raise income? If β₁ = 2500 (significant), the program raises income by Tk 2500 holding other factors fixed. This is the basis of impact evaluation / program evaluation in development economics!
· · ·
E9
Advanced
Simultaneous Equation Models — Cause and Effect in Both Directions
🔄
What it is
Bidirectional Causality
In many economic situations, variables determine each other simultaneously. Supply & demand: price determines quantity demanded AND quantity supplied determines price. This simultaneity causes OLS to be biased and inconsistent — the "simultaneity bias."
⚙️
Endogenous vs Exogenous
Variable Classification
Endogenous (jointly determined): Price & Quantity in supply-demand system
Structural form: The economic behavioural equations
Reduced form: Each endogenous variable expressed only in terms of exogenous variables
💡
Identification Problem
Can We Estimate the Equations?
Under-identified: Cannot estimate from data alone
Exactly identified: Unique estimates possible
Over-identified: Multiple estimates possible; use 2SLS
Order condition: (K−k) ≥ (m−1) where K=total exogenous, k=exogenous in equation, m=endogenous in equation
🔢
Estimation Methods
How to Estimate
ILS (Indirect Least Squares): For exactly identified equations
2SLS (Two-Stage Least Squares): Most popular for over-identified. Stage 1: regress endogenous X on instruments; Stage 2: use fitted X̂ in main regression
2SLS Stage 1Regress P on ALL exogenous variables → get P̂
2SLS Stage 2Replace P with P̂ in structural equation → OLS gives consistent estimates
😄 Why OLS Fails Here"Using OLS for a simultaneous system is like trying to figure out who started a fight when both parties hit each other at exactly the same time — you can't tell cause from effect!" Price rises → quantity supplied rises (supply); but quantity demanded falls → price falls (demand). OLS blends these two directions and gives wrong answers for both. 2SLS untangles them using instruments.
· · ·
E10
Time Series
Time Series Econometrics — Stationarity, Unit Roots & Cointegration
⚡ The Spurious Regression Warning!Regressing one non-stationary time series on another can give a high R² and significant t-statistics PURELY BY CHANCE — even if they have nothing to do with each other. Example: Bangladesh rice production and global smartphone sales both trend upward → regressing one on the other gives R²=0.94 but it is COMPLETELY MEANINGLESS. Always test for stationarity first!
📈
Stationarity
The Key Concept in Time Series
A time series is weakly stationary if its mean, variance, and autocovariances are constant over time (don't depend on t). Most economic time series (GDP, prices, exchange rates) are NON-stationary — they have trends and drifts.
⚙️
Unit Root Tests
Testing for Non-stationarity
Augmented Dickey-Fuller (ADF) test: H₀: series has unit root (non-stationary); Reject H₀ → stationary. The most widely used test.
Phillips-Perron (PP) test: Non-parametric correction for serial correlation
KPSS test: H₀: stationary (opposite hypothesis — use alongside ADF)
💡
Cointegration
Long-Run Equilibrium
Two non-stationary I(1) series are cointegrated if their linear combination is stationary I(0). They share a long-run equilibrium relationship. Use Engle-Granger two-step method or Johansen test. If cointegrated: use Error Correction Model (ECM).
🌍 Bangladesh ApplicationTesting whether the taka-dollar exchange rate and domestic price level are cointegrated (Purchasing Power Parity). Both series are I(1). Engle-Granger test finds cointegration — a long-run PPP relationship holds. Estimate ECM: the speed-of-adjustment coefficient γ̂ = −0.23 means 23% of any deviation from long-run PPP is corrected each quarter. Highly useful for monetary policy!
🎓 Why Multivariate Analysis?
"In real life, nothing happens in isolation." Blood pressure AND cholesterol AND BMI together predict heart disease — not one alone. Multivariate analysis handles p variables simultaneously, capturing their joint distributions, correlations, and interactions. As Johnson & Wichern put it: "Most data sets encountered in practice contain measurements on several variables that must be analyzed jointly." The key advantage: we preserve the covariance structure that gets lost when analyzing variables one at a time. 😄 Joke: "A univariate statistician sees a forest of trees. A multivariate statistician sees the forest, the ecosystem, the relationships between trees, AND the soil composition — all at once!"
Multivariate Analysis (MVA) refers to statistical techniques for analysing data with p ≥ 2 variables measured on each observation. Goal: understand the joint behaviour, interdependencies, and structure of these variables simultaneously — not one at a time.
✅
Applications
Where MVA is Used
Medical: Joint analysis of blood pressure, cholesterol, BMI, age for heart disease risk
Ecology: Species abundance across multiple environmental variables
Finance: Portfolio of stocks — returns, risks, correlations simultaneously
Agriculture: Crop yield as function of soil, rain, temperature, fertiliser jointly
💡
Key Concept
The Data Matrix
MVA operates on an n × p data matrix X: n observations (rows), p variables (columns). Each row is a p-dimensional observation vector xᵢ = (x_{i1}, x_{i2}, …, x_{ip})'. The entire dataset is the matrix X of dimension n×p.
⚠️
When NOT to Use
Limitations & Cautions
Requires multivariate normality for many classical methods — always check!
Highly sensitive to outliers — a single bad row can distort everything
Sample size n must be >> p (as a rule: n ≥ 5p minimum)
Interpretation becomes very challenging as p grows large ("curse of dimensionality")
😄 The "Curse of Dimensionality" Joke"In 1D you need 10 points to understand a distribution. In 10D you need 10¹⁰ points — more than the world's population squared. This is why every multivariate statistician is simultaneously excited about p variables and terrified of having too many." — The curse is real, and MVA is largely about fighting it!
· · ·
M2
Distance Measures
Euclidean & Statistical Distance
📏
Euclidean Distance
Ordinary Geometric Distance
The familiar straight-line distance between two points x and y in p-dimensional space: d(x,y) = √[Σᵢ(xᵢ−yᵢ)²]. Simple but has a critical flaw: it treats all variables equally regardless of their scale or correlation. A variable measured in kilometres swamps one measured in centimetres!
🎯
Mahalanobis Distance
Statistical Distance — The MVP
Mahalanobis distance accounts for the scale AND correlation structure of the data via the covariance matrix Σ: d²(x,μ) = (x−μ)'Σ⁻¹(x−μ). It's unit-free and correlation-corrected. Think of it as Euclidean distance in "standardised space" rotated to remove correlations.
💡
Why Mahalanobis?
Advantages Over Euclidean
Scale-invariant — variables on different units treated fairly
Accounts for correlations — correlated variables don't double-count
Identifies multivariate outliers — points far from the centroid in σ units
d²(x,μ) ~ χ²(p) under multivariate normality — useful for outlier detection!
🌍
Real Example
Medical Diagnosis
Patient has systolic BP=140mmHg and age=45 years. Euclidean distance from population mean (120mmHg, 40yrs) = √(20²+5²) = 20.6. But BP and age have different scales AND are correlated. Mahalanobis distance gives a meaningful "how unusual is this patient" measure corrected for both scale and the BP-age correlation.
Sample versiond²(xᵢ,x̄) = (xᵢ−x̄)' S⁻¹ (xᵢ−x̄) ~ χ²(p) approximately
Outlier thresholdFlag xᵢ as outlier if d²(xᵢ,x̄) > χ²_(0.975)(p)
😄 Distance Analogy"Euclidean distance measures 'as the crow flies.' Mahalanobis distance measures 'as the statistician walks' — taking into account the terrain (correlations) and the different scales of measurement (variances). They're both right, but one is much smarter about context."
Every symmetric positive definite matrix A can be decomposed as: A = PΛP' where P = matrix of eigenvectors (orthonormal columns) and Λ = diagonal matrix of eigenvalues λ₁ ≥ λ₂ ≥ … ≥ λₚ > 0. The eigenvectors give the "principal directions" of the data; eigenvalues give the "lengths" in those directions. Foundation of PCA!
🔺
Cholesky Decomposition
Lower-Triangular Factorisation
Every positive definite matrix Σ can be written as Σ = LL' where L is a lower-triangular matrix with positive diagonal entries. Why useful? (1) Simulate multivariate normal data: if Z~N(0,I), then X = μ + LZ ~ N(μ,Σ). (2) Solve linear systems efficiently. (3) Check positive definiteness — Cholesky fails if Σ is not positive definite.
💡
Square Root of Matrix
Matrix Square Root A^(1/2)
Using spectral decomposition: A^(1/2) = PΛ^(1/2)P' where Λ^(1/2) = diag(√λ₁, …, √λₚ). Property: A^(1/2) · A^(1/2) = A. Used to transform data to uncorrelated form: if X ~ Nₚ(μ,Σ), then Σ^(-1/2)(X−μ) ~ Nₚ(0,I) — the "sphering" or "whitening" transformation essential for many multivariate tests.
🔢
Partitioned Covariance
Block Structure of Σ
Partition the p-vector x = (x₍₁₎', x₍₂₎')' into groups of p₁ and p₂ variables. Then Σ = [[Σ₁₁, Σ₁₂],[Σ₂₁, Σ₂₂]] where Σ₁₁=Var(x₍₁₎), Σ₂₂=Var(x₍₂₎), Σ₁₂=Cov(x₍₁₎,x₍₂₎). Used in canonical correlation, conditional distributions, and regression of one group on another.
😄 Matrix Square Root Joke"Why can't a matrix go to therapy alone? Because it needs its square root to become 'whole' — and its inverse to undo its past mistakes!" More seriously: the matrix square root is what lets us transform any multivariate normal distribution into a standard one, making everything else tractable.
· · ·
M4
Variation in p Dimensions
Covariance Matrix & Generalised Variance
📊
Covariance Matrix Σ
The Multivariate Analogue of Variance
For a p-dimensional random vector X, the covariance matrix Σ (p×p) captures ALL pairwise variances and covariances: σᵢᵢ = Var(Xᵢ) on diagonal; σᵢⱼ = Cov(Xᵢ,Xⱼ) off-diagonal. Σ is symmetric and positive (semi)definite. The sample version S = (n−1)⁻¹Σᵢ(xᵢ−x̄)(xᵢ−x̄)' is the unbiased estimator.
🔢
Generalised Variance
|Σ| — One Number for All Variation
The determinant |Σ| is called the generalised variance — it summarises the total variation in all p variables in a single number. Geometrically: |Σ| is proportional to the squared volume of the p-dimensional ellipsoid formed by the data. |Σ| = 0 means variables are perfectly linearly dependent (degenerate distribution).
💡
Total Variation
Trace of Σ — Alternative Summary
tr(Σ) = σ₁₁ + σ₂₂ + … + σₚₚ = sum of all variances. This is the "total variance" measure. tr(Σ) = Σλᵢ (sum of eigenvalues). Used in PCA: proportion of variance explained by kth PC = λₖ/tr(Σ). Both |Σ| and tr(Σ) are used as scalar measures of multivariate scatter.
🌍
Correlation Matrix
Standardised Version
R = D^(-1/2) Σ D^(-1/2) where D = diag(σ₁₁,…,σₚₚ). All diagonal entries of R = 1; off-diagonal rᵢⱼ ∈ [−1,1]. Working with R (instead of Σ) is equivalent to standardising all variables to unit variance. Most MVA methods can work with either Σ or R — the choice matters for interpretation!
Population ΣΣ = E[(X−μ)(X−μ)'] (p×p symmetric positive definite)
Sample SS = (1/(n−1)) · Σᵢ(xᵢ−x̄)(xᵢ−x̄)'
Generalised Variance|S| = det(S) (volume of data ellipsoid)
Total Variancetr(S) = s₁₁ + s₂₂ + … + sₚₚ = Σᵢ λᵢ
Correlation MatrixR = D^(-1/2) S D^(-1/2) (D = diag of variances)
· · ·
M5
Core Distribution
The Multivariate Normal Distribution
🔔
Definition & Meaning
Nₚ(μ, Σ)
A p-dimensional random vector X follows a multivariate normal distribution Nₚ(μ,Σ) if every linear combination a'X is (univariate) normal for any non-zero vector a. Parameters: mean vector μ (p×1) — location; covariance matrix Σ (p×p) — shape and spread. The MVN is completely characterised by just these two parameters!
📐
Properties
Key Properties of MVN
Marginals are normal: Each Xᵢ ~ N(μᵢ, σᵢᵢ)
Conditionals are normal: (X₁|X₂=x₂) ~ N(μ₁.₂, Σ₁₁.₂)
Linear combinations: AX+b ~ N(Aμ+b, AΣA')
Uncorrelated → Independent: UNIQUE to MVN! If Cov(Xᵢ,Xⱼ)=0 then Xᵢ⊥Xⱼ
Quadratic forms: (X−μ)'Σ⁻¹(X−μ) ~ χ²(p)
💡
Contours & Geometry
Elliptical Contours
Contours of constant density for MVN are ellipsoids in p-dimensional space: {x : (x−μ)'Σ⁻¹(x−μ) = c²}. The shape/orientation is determined by Σ. Axes of the ellipse = eigenvectors of Σ; lengths proportional to √λᵢ. In 2D: a tilted ellipse if variables are correlated, circles if uncorrelated.
⚠️
Important Caution
Marginals Normal ≠ Joint Normal
Each variable being normally distributed does NOT imply joint multivariate normality! A classic counterexample: X~N(0,1) and Y = X if |X|>1, Y = −X otherwise. Then X~N, Y~N but (X,Y) is NOT bivariate normal. Always test joint normality, not just marginals!
Bivariate Normal Contours — Different Correlation Structures
· · ·
M6
Estimation
MLE of Mean Vector & Covariance Matrix
🎯
MLE of μ
Sample Mean Vector
The MLE of the mean vector μ is simply the sample mean vector x̄ = (1/n)Σᵢxᵢ. It is unbiased E(x̄)=μ and its sampling distribution is: x̄ ~ Nₚ(μ, Σ/n). Larger n → smaller variance of x̄ → more precise estimate. Intuition: just average each variable separately.
⚙️
MLE of Σ
MLE vs Unbiased Estimator
MLE: Σ̂ = (1/n)Σᵢ(xᵢ−x̄)(xᵢ−x̄)' — biased (uses n, not n−1)
Unbiased S: S = (1/(n−1))Σᵢ(xᵢ−x̄)(xᵢ−x̄)' — used in practice
MLE is biased by factor (n−1)/n — for large n, difference negligible
Both are consistent estimators (converge to Σ as n→∞)
💡
Sufficiency
Sufficient Statistics for MVN
For MVN data, (x̄, S) is a jointly sufficient statistic for (μ, Σ) — meaning all information in the sample about the parameters is captured by the sample mean vector and sample covariance matrix. No other summary can add more information. This is the multivariate analogue of the fact that (x̄, s²) is sufficient for (μ,σ²) in univariate normal.
📈
Large Sample Behaviour
Asymptotic Results
√n(x̄ − μ) → Nₚ(0, Σ) as n→∞ (multivariate CLT)
n·(x̄−μ)'S⁻¹(x̄−μ) → χ²(p) as n→∞
S → Σ in probability (consistency)
These are the basis for large-sample inference about μ
MLE of μμ̂ = x̄ = (1/n) Σᵢ xᵢ (unbiased)
MLE of Σ (biased)Σ̂ = (1/n) Σᵢ(xᵢ−x̄)(xᵢ−x̄)'
Unbiased SS = (1/(n−1)) Σᵢ(xᵢ−x̄)(xᵢ−x̄)' (used in tests)
Distribution of x̄x̄ ~ Nₚ(μ, Σ/n)
Multivariate CLT√n(x̄ − μ) →_d Nₚ(0, Σ) as n→∞
· · ·
M7
Diagnostics
Assessing Multivariate Normality
🔬
Step 1: Marginal Checks
Univariate Marginal Normality
Plot histogram and Q-Q plot for each variable separately
Shapiro-Wilk or Kolmogorov-Smirnov test for each Xⱼ
Check for skewness and kurtosis near 0 and 3 respectively
Warning: All marginals normal ≠ joint MVN! This is necessary but NOT sufficient
😄 Transformation Tip"Transforming data to normality is like ironing a wrinkled shirt — the content (information) doesn't change, but the shape becomes much more manageable. The Box-Cox transformation is like an automatic iron that figures out the right temperature (λ) by itself!" Remember to always report which transformation was used so results can be back-transformed for interpretation.
· · ·
M8
Sampling Theory
Wishart Distribution & Sampling Distributions
📐
Wishart Distribution
Multivariate Analogue of χ²
If X₁,…,Xₙ are iid Nₚ(0,Σ), then the matrix W = Σᵢ XᵢXᵢ' ~ Wₚ(n,Σ) follows a Wishart distribution with n degrees of freedom and scale matrix Σ. The sample covariance matrix satisfies: (n−1)S ~ Wₚ(n−1,Σ). It is the matrix generalisation of the chi-square distribution — just as s² has a chi-square distribution in univariate normal, S has a Wishart distribution!
⚙️
Properties of Wishart
Key Facts
E(W) = nΣ — so E(S) = Σ (unbiased)
If p=1: W reduces to σ²χ²(n) — the familiar univariate result
Tests H₀: μ = μ₀ (mean vector equals a specified vector). Hotelling's T² = n(x̄−μ₀)'S⁻¹(x̄−μ₀). This is the multivariate generalisation of the one-sample t-test. Under H₀: [(n−p)/p(n−1)]·T² ~ Fₚ,ₙ₋ₚ. Reject H₀ if this exceeds Fₐ(p,n−p). The TWO-SAMPLE version tests H₀: μ₁=μ₂ using the pooled covariance matrix.
📊
MANOVA
Multivariate ANOVA
MANOVA tests whether group mean vectors are equal: H₀: μ₁=μ₂=…=μg. Decomposes the total scatter matrix T into: T = H + E where H=between-group (hypothesis) matrix and E=within-group (error) matrix. Tests use functions of H and E — primarily Wilks' Lambda Λ = |E|/|H+E|.
💡
MANOVA Test Statistics
Four Equivalent Tests
Wilks' Lambda: Λ = |E|/|T| — most widely used
Pillai's Trace: tr(H(H+E)⁻¹)
Hotelling-Lawley Trace: tr(HE⁻¹)
Roy's Largest Root: λ₁/(1+λ₁) — most powerful for single-direction alternatives
All four equivalent in large samples; differ for small n or specific alternatives
⚠️
MANOVA Assumptions
Requirements
Multivariate normality within each group
Homogeneity of covariance matrices: Σ₁=Σ₂=…=Σg (Box's M test)
Independence of observations
n > p (more obs than variables — essential!)
⚠ If assumptions violated → use permutation MANOVA (vegan package)
Hotelling T²T² = n(x̄−μ₀)'S⁻¹(x̄−μ₀)
T² to FF = [(n−p)/p(n−1)] · T² ~ Fₚ,ₙ₋ₚ under H₀
MANOVA decomp.T = H + E (Total = Between + Within)
😄 MANOVA Analogy"MANOVA is like ANOVA but instead of asking 'do these groups have different means on ONE measure?' it asks 'do these groups differ on ANY combination of ALL measures simultaneously?' It's like comparing entire personality profiles rather than just one trait. Much more powerful when variables are correlated!" — And Wilks' Lambda is like the p-value's sophisticated older sibling who considers the whole picture.
· · ·
M10
Prediction
Multivariate Multiple Regression
📉
What it is
Multiple Y, Multiple X
Multivariate multiple regression has multiple response variables Y (n×m matrix) AND multiple predictors X (n×(k+1) matrix). Model: Y = XB + E where B (k+1)×m is the coefficient matrix and E is n×m error matrix. Each column of Y is a separate response; they share the same predictors X.
⚙️
Estimation
Matrix OLS
OLS estimator: B̂ = (X'X)⁻¹X'Y. Each column of B̂ is the OLS solution for that response variable separately — so multivariate regression is equivalent to running m separate univariate regressions! However, joint analysis is more efficient and enables tests involving ALL responses simultaneously.
💡
Why Use Jointly?
Advantage of Joint Analysis
Tests on coefficient matrix B involving multiple responses simultaneously
Accounts for correlations among response variables → more powerful tests
Can test hypotheses of form CBM = 0 (general linear hypothesis)
Residual covariance matrix Ê'Ê/(n−k−1) estimates Σ — the cross-response correlations
Model (matrix)Y(n×m) = X(n×(k+1)) · B((k+1)×m) + E(n×m)
OLS estimatorB̂ = (X'X)⁻¹X'Y
Residual matrixÊ = Y − XB̂ = (I − H)Y where H=X(X'X)⁻¹X'
Error covariance est.Σ̂ = Ê'Ê/(n−k−1)
General hypothesisH₀: CBM = 0 → test via Wilks' Λ or Hotelling trace
🎓 The Big Picture of MVA II
Where Multivariate I asked "how are variables distributed and how do we test hypotheses about means?", Multivariate II asks "what structure is hidden in the data?" PCA finds orthogonal dimensions of maximum variance. Factor Analysis finds latent constructs driving correlations. Cluster Analysis groups similar observations. Discriminant Analysis builds rules to classify new observations. Together, these are the core of unsupervised and supervised multivariate learning. 😄 "MVA II is where statistics starts looking suspiciously like machine learning — because it basically is!"
😄 PCA Analogy"PCA is like finding the best angle to photograph a 3D sculpture so it reveals the most information in a 2D photo. You rotate your perspective to capture maximum variance in each new direction — the first principal component is the angle with the best overall view, the second adds what the first missed, and so on!" Each photo (PC) is orthogonal to the others.
📊
What it is
Finding Maximum Variance Directions
PCA transforms p correlated variables into p uncorrelated Principal Components (PCs) that are linear combinations of the originals. PC1 captures maximum variance; PC2 captures maximum of remaining variance orthogonal to PC1; and so on. Goal: represent data in fewer dimensions with minimal information loss.
⚙️
How PCA Works
The Eigenvalue Approach
Compute S (or R for standardised PCA)
Find eigenvalues λ₁≥λ₂≥…≥λₚ and eigenvectors e₁,e₂,…,eₚ of S
ith PC: Yᵢ = eᵢ'X (linear combination with eigenvector weights)
Var(Yᵢ) = λᵢ; Cov(Yᵢ,Yⱼ) = 0 for i≠j
Retain k PCs where Σᵢ₌₁ᵏ λᵢ/tr(S) ≥ 0.80 (80% variance rule)
💡
Choosing # of PCs
How Many to Keep?
80% variance rule: Keep enough PCs to explain ≥80% of total variance
Scree plot: Plot λᵢ vs i; look for "elbow" — PCs before the bend
Kaiser criterion: Keep PCs with λᵢ > 1 (from R, not S)
Loadingslᵢⱼ = eᵢⱼ · √λᵢ (correlation between PC i and variable j scaled)
Communalityhⱼ² = Σᵢ lᵢⱼ² (variance of Xⱼ explained by retained PCs)
🌍 Real Application: Socioeconomic IndexBangladesh district data: 8 variables (income, education, health access, sanitation, literacy, employment, poverty rate, infrastructure). PCA extracts PC1 (accounts for 62% variance) which has high positive loadings on income, education, infrastructure and negative loading on poverty — this is a "development index" that can rank districts. Avoids multicollinearity issues in regression by replacing 8 correlated variables with 2-3 orthogonal PCs.
· · ·
A2
Signal Separation
Independent Component Analysis (ICA)
🎵
What it is
Beyond Uncorrelated — Finding Independence
ICA decomposes X = AS + noise where S are statistically independent source signals and A is the mixing matrix. Goal: estimate A and recover S. Unlike PCA (finds uncorrelated components), ICA finds components that are statistically independent — a much stronger condition. Non-Gaussian sources are required!
🔊
The Cocktail Party Problem
Classic Motivation
Imagine p microphones recording a party with p speakers talking simultaneously. Each microphone records a mixture of all voices. ICA recovers the individual voices (independent sources) from the mixed recordings. Applications: EEG/fMRI brain signal separation, audio source separation, financial return decomposition, image processing.
😄 Factor Analysis Analogy"Factor analysis is like figuring out that what students score on reading, writing, and comprehension tests is really driven by a single underlying construct: 'verbal intelligence.' You can't directly measure verbal intelligence, but you can observe its effects on multiple tests. Factor analysis extracts these invisible factors that drive the observable correlations." — Widely used in psychology, social science, and education.
🔍
What it is
Latent Factor Model
Factor Analysis (FA) models p observed variables as linear combinations of m << p latent (unobservable) common factors F plus unique factors: X = μ + LF + ε. L is the (p×m) loading matrix; F are m common factors; ε are p unique (specific) factors. Goal: interpret the common factors as meaningful latent constructs.
⚙️
FA vs PCA
Critical Differences
PCA: Explains total variance; components are explicit linear combos of X; descriptive
FA: Explains common variance only (not unique/error variance); factors are latent unobservables; model-based
PCA: Unique solution; components are ordered by variance
FA: Solution not unique — rotation can be applied to improve interpretability!
💡
Factor Rotation
Making Factors Interpretable
Orthogonal rotation (Varimax): Maximises variance of squared loadings per column — produces "simple structure" where each variable loads highly on one factor and near-zero on others. Factors remain uncorrelated.
Oblique rotation (Promax, Oblimin): Allows factors to be correlated — more realistic when latent constructs are related (e.g., verbal and mathematical intelligence are correlated)
⚠️
Conditions
When FA is Appropriate
✅ Variables are correlated (|R|<1) — if uncorrelated, no common factors exist
✅ You believe latent constructs drive the correlations (theory-driven)
✅ Communalities h² should be reasonable — if all h²≈0, model fails
❌ Don't use FA when all variance is unique — use PCA instead
⚠ Factor identification requires subjective interpretation — what does Factor 1 "mean"?
Communalityhⱼ² = Σₖ lⱼₖ² (proportion of Var(Xⱼ) explained by common factors)
Uniquenessψⱼ = 1 − hⱼ² (proportion unexplained by common factors)
Factor scoresF̂ = L'Σ⁻¹(X−μ) (Bartlett's method)
🌍 Bangladesh Application: Poverty Index10 district-level variables measured: income, education, sanitation, health access, child mortality, malnutrition, electricity, road access, drinking water quality, school enrolment. FA extracts 3 factors: Factor 1 (high loadings on income, electricity, roads) = "Infrastructure & Economy"; Factor 2 (health access, child mortality, malnutrition) = "Health Status"; Factor 3 (education, school enrolment) = "Human Capital". These factors become inputs to a multidimensional poverty index. Much more interpretable than raw 10-variable data!
· · ·
A4
Unsupervised Grouping
Cluster Analysis — Finding Natural Groups
😄 Clustering Joke"Cluster analysis is what you do when you have data but no one told you what groups exist. It's like showing up at a party where you know nobody — after a while you notice people naturally cluster by interest, age group, or how loudly they speak. Cluster analysis does this mathematically, without you having to mingle!" The key challenge: you don't know the 'right' answer — there's no objective truth in unsupervised learning.
🗂️
What it is
Grouping Without Labels
Cluster analysis partitions n observations into g groups (clusters) such that observations within a cluster are similar and observations between clusters are dissimilar. It is unsupervised — no predefined groups or labels. Goal: discover natural structure in data.
⚙️
Hierarchical Clustering
Building a Dendrogram
Agglomerative (bottom-up): Start with n clusters (each obs = 1 cluster); merge closest pair; repeat until all in 1 cluster. Most common.
Divisive (top-down): Start with 1 cluster; split recursively
Linkage methods: Single (minimum distance), Complete (maximum), Average (UPGMA), Ward's (minimise within-cluster variance)
Result: Dendrogram — cut at desired level to get g clusters
💡
K-Means Clustering
Iterative Partitioning
Specify k (number of clusters) in advance
Algorithm: (1) Assign each obs to nearest centroid; (2) Update centroids as cluster means; (3) Repeat until convergence
Minimises: Σₖ Σᵢ∈Cₖ ‖xᵢ − μₖ‖² (within-cluster sum of squares)
Sensitive to initial centroids — run multiple times with random starts
Choosing k: Elbow plot, Silhouette coefficient, Gap statistic
⚠️
Conditions & Cautions
When Each Method Works
✅ Hierarchical: Small-medium n; want to see ALL possible groupings; no need to prespecify k
✅ K-means: Large n; approximately spherical clusters; k known or can be estimated
🌍 Bangladesh Health Cluster64 districts clustered on 6 health indicators. K-means (k=3 chosen by elbow plot) identifies: Cluster 1 (12 districts, Dhaka-centred) = high healthcare access, low mortality; Cluster 2 (28 districts) = moderate on all indicators; Cluster 3 (24 districts, Char/haor areas) = low access, high child mortality, high malnutrition. This clustering directly informs resource allocation for the Ministry of Health — districts in Cluster 3 receive priority funding.
· · ·
A5
Supervised Classification
Discriminant & Classification Analysis
😄 Discriminant Analysis Analogy"Discriminant analysis is like training a sorting machine. You show it thousands of labelled patients ('has disease' / 'no disease') along with their test results. It learns the pattern of test results that best separates the groups. Then when a new patient arrives with only test results (no diagnosis), the machine classifies them. Fisher's Linear Discriminant is one of the oldest and most elegant classification algorithms — predating neural networks by 80+ years!"
🎯
What it is
Supervised Group Separation
Discriminant Analysis has TWO goals: (1) Description: find linear combinations of variables (discriminant functions) that best separate g known groups; (2) Classification: build a rule to assign future observations to one of the g groups. Unlike cluster analysis: group memberships are KNOWN for the training data.
⚙️
Fisher's LDA
Linear Discriminant Analysis
Find direction w that maximises between-group variance / within-group variance: w = Sₚ⁻¹(x̄₁ − x̄₂) for 2-group case
Classify new x to group 1 if: w'x ≥ midpoint(w'x̄₁, w'x̄₂)
Assumes equal covariance matrices Σ₁=Σ₂ → uses pooled Sₚ
For g>2: compute g−1 discriminant functions
💡
Probabilistic Classification
Bayes Classification Rules
Linear discriminant rule (LDA): Equal Σ → linear boundary
🌍 Bangladesh Medical ApplicationClassifying TB patients into 3 treatment response groups (rapid/moderate/slow responder) based on 6 baseline clinical variables (age, BMI, sputum grade, haemoglobin, ESR, CD4 count). LDA builds two discriminant functions. Cross-validated APER = 18% (82% correctly classified). The discriminant scores of new patients can be computed from their baseline labs to predict treatment response category — guiding personalised treatment decisions before expensive sensitivity testing is complete.
STAT2201 · Sampling Distribution
x̄
STAT2201 · B.Sc. Statistics Year 2 · BRUR
Sampling Distribution
Sampling Distributions of Mean · Variance · Proportions · CLT · t · F · χ² Distributions · Estimation · Confidence Intervals
🎓 What is a Sampling Distribution?
"If you took your sample 10,000 times and computed the mean each time, what would the distribution of those means look like?" THAT is the sampling distribution — not the distribution of data, but the distribution of a statistic over repeated sampling. 😄 "The sampling distribution is the bridge between data and inference — without it, statistics would just be fancy arithmetic."
Population (N): All items of interest — fixed but usually unobservable
Parameter: μ, σ², π — numerical summaries of the population, FIXED but UNKNOWN
Almost never observe the whole population — too large, costly, or destructive
🔬
Sample & Statistics
What We Actually Observe
Sample (n): n observations drawn from the population
Statistic: x̄, s², p̂ — functions of the sample; RANDOM VARIABLE before sampling
The KEY insight: statistics vary sample to sample — this variation has a pattern = sampling distribution
💡
3 Different Distributions
Never Confuse These!
Population distribution: All individuals — shape could be anything
Sample distribution: Your n observations — approximates population
Sampling distribution: Distribution of the STATISTIC over repeated samples
😄 "Confusing these three is the #1 intro-stats mistake. The CLT applies to the THIRD one!"
⚠️
Standard Error
SE ≠ SD
SD: Variability of individual observations (fixed, doesn't shrink with n)
SE(x̄): Variability of the sample MEAN over repeated samples = σ/√n
SE shrinks as n increases — more data → more precise estimate of μ
Standard Error of x̄SE(x̄) = σ/√n (decreases with n — more data = more precise)
UnbiasednessE(x̄) = μ ; E(s²) = σ² (why we divide by n−1, not n)
· · ·
S2
Key Result
Sampling Distribution of the Mean
📊
Normal Population
Exact Result (Any n)
If X₁,…,Xₙ iid N(μ,σ²), then x̄ ~ N(μ, σ²/n) exactly for any n. Standardise: Z = (x̄−μ)/(σ/√n) ~ N(0,1). When σ unknown, replace with s → T = (x̄−μ)/(s/√n) ~ t(n−1).
💡
Effect of n
More Data = Narrower Distribution
Larger n → smaller SE = σ/√n → sampling distribution narrows around μ
Doubling n reduces SE by √2 ≈ 1.41 (not by 2 — diminishing returns!)
To halve the SE, you must QUADRUPLE n — sampling is expensive!
Let X₁,…,Xₙ be iid with mean μ and finite variance σ². Then as n→∞: √n(x̄−μ)/σ →_d N(0,1) regardless of the population distribution shape. For large n: x̄ ≈ N(μ, σ²/n). This is why the normal distribution appears everywhere!
💡
Why it's Magic
Any Population → Normal x̄
Population can be exponential, uniform, skewed, bimodal — doesn't matter!
n ≥ 30: CLT approximation usually good; n ≥ 50 for very skewed populations
Foundation for: t-tests, z-tests, ANOVA, regression inference, and almost everything
😄 "CLT: Statistics' superhero. No matter what messy distribution you throw at it — average enough and you get normal. Every time."
⚠️
When CLT Fails
Important Exceptions
Cauchy distribution: NO finite variance → CLT doesn't apply
Very small n with highly skewed data
Dependent observations: standard CLT requires independence
Always use exact t/F/χ² when population is exactly normal
Practical formx̄ ≈ N(μ, σ²/n) for n≥30 approximately
Sum versionSₙ = ΣXᵢ ≈ N(nμ, nσ²) for large n
· · ·
S4
Variance Inference
Chi-Square, t & F Distributions
📐
Chi-Square χ²(k)
Sum of Squared Normals
χ²(k) = Z₁²+…+Zₖ² where Zᵢ iid N(0,1)
Mean=k; Var=2k; Right-skewed; always ≥ 0
Sampling dist of variance: (n−1)s²/σ² ~ χ²(n−1)
⚠ Requires population normality — sensitive to departures!
🍺
t Distribution
Z ÷ √(χ²/ν) — The Guinness Distribution
t(ν) = Z/√(χ²(ν)/ν); heavier tails than N(0,1)
T = (x̄−μ)/(s/√n) ~ t(n−1) when sampling from N(μ,σ²)
As ν→∞: t(ν) → N(0,1)
😄 "Invented by Gosset at Guinness Brewery — published as 'Student' because Guinness prohibited employee publications. Cheers to small samples! 🍺"
💡
F Distribution
Ratio of Two Chi-Squares
F(k₁,k₂) = [χ²(k₁)/k₁] / [χ²(k₂)/k₂]
F = s₁²/s₂² ~ F(n₁−1,n₂−1) for variance ratio test
F = MSA/MSE in ANOVA; t²(ν) = F(1,ν)
Named for Ronald Fisher — inventor of ANOVA, p-values, and experimental design
χ² from sample variance(n−1)s²/σ² ~ χ²(n−1) (population normal)
CI for σ²[(n−1)s²/χ²_{α/2}, (n−1)s²/χ²_{1−α/2}]
Two-sample t (equal σ)T = (x̄₁−x̄₂)/(sₚ√(1/n₁+1/n₂)) ~ t(n₁+n₂−2)
CI for μ (σ unknown)x̄ ± t_{α/2,n−1} · s/√n
Sample size for μn = (z_{α/2} · σ / E)² (E = desired margin of error)
🌍 Bangladesh ExampleA nutritionist samples 40 children aged 5–10 from Rangpur to estimate mean height. Sample mean = 112 cm, s = 8.4 cm. 95% CI: 112 ± t_{0.025,39} × 8.4/√40 = 112 ± 2.023 × 1.33 = [109.3, 114.7] cm. We are 95% confident the true population mean height is between 109.3 and 114.7 cm. To halve the margin of error, we would need n = 4×40 = 160 children — quadrupling the sample!
· · ·
S5
Proportions
Sampling Distribution of Proportions & Estimation
📈
Sample Proportion
For Binary Outcomes
p̂ = X/n where X~Binomial(n,p). E(p̂)=p (unbiased); Var(p̂)=p(1−p)/n. By CLT: p̂ ≈ N(p, p(1−p)/n) when np≥10 AND n(1−p)≥10. Standard error: SE(p̂) = √[p(1−p)/n].
💡
CI Interpretation
What 95% CI Really Means
A 95% CI: if repeated sampling 100 times and CI computed each time, about 95 of those intervals contain the true μ. It does NOT mean "95% probability μ is in this specific interval" — μ is fixed! 😄 "The CI is a fishing net — 95% of the time it catches the fish (μ). Once cast, the fish is either inside or not."
p̂ approx. dist.p̂ ≈ N(p, p(1−p)/n) for large n (np≥10 AND n(1−p)≥10)
95% CI for pp̂ ± 1.96 · √[p̂(1−p̂)/n] (Wald interval)
Sample size for pn = z²_{α/2} · p(1−p) / E² (use p=0.5 if unknown)
STAT2203 · Analysis of Variance & Design of Experiment
🎓 ANOVA in one sentence
ANOVA tests whether means of 3+ groups differ — by comparing BETWEEN-group variance to WITHIN-group variance. Why not just do many t-tests? With g groups you'd need C(g,2) t-tests, inflating Type I error massively. ANOVA controls this with ONE test. 😄 "ANOVA: Statistics' way of comparing all your groups at once, without letting false alarms pile up." Fisher's golden rule of DOE: "Block what you can; randomise what you cannot."
H₀: μ₁=μ₂=…=μg vs H₁: at least one μᵢ differs. Partitions total variation: SST = SSA + SSE. If between-group (MSA) >> within-group (MSE), groups differ. F = MSA/MSE ~ F(g−1, N−g) under H₀.
⚙️
The Logic
Why Variance Tests Means
MSA (between): Measures group-mean differences — large if μᵢ differ
MSE (within): Measures random error — unaffected by group differences
Under H₀: both estimate σ² → F≈1. Under H₁: MSA >> MSE → F >> 1
💡
Effect Size
η² — How Meaningful Is the Effect?
η² = SSA/SST: proportion of variance explained. Benchmarks: small=0.01, medium=0.06, large=0.14. ALWAYS report — significant F with tiny η² means real but trivially small difference! 😄 "Statistical significance ≠ practical importance."
⚠️
What ANOVA Doesn't Tell
Which Groups Differ?
Significant F only says "at least one mean differs" — need post-hoc tests to find WHICH pairs differ. Never claim the group with highest mean is significantly different without a post-hoc test — that's data dredging!
Bonferroni adjusted αα* = α/m (m = number of comparisons)
· · ·
V3
Two Factors & Interaction
Two-Way ANOVA & Interaction Effect
📐
Two-Way ANOVA
Model & Decomposition
Tests: (1) Main effect of A; (2) Main effect of B; (3) Interaction A×B. SST = SSA + SSB + SSAB + SSE. Interaction is most interesting — does the effect of A depend on the level of B? Plot interaction plots: parallel lines = no interaction; crossing lines = interaction present.
💡
Interaction
It Depends! — The Most Important Result
Significant interaction means the effect of fertiliser on yield DEPENDS on which crop variety is used. Cannot interpret main effects in isolation when interaction is significant. 😄 "Interaction is statistics saying: 'it depends' — and that's almost always the most scientifically interesting answer."
Two-way modelYᵢⱼₖ = μ + αᵢ + βⱼ + (αβ)ᵢⱼ + εᵢⱼₖ
SS decompositionSST = SSA + SSB + SSAB + SSE
F for interactionF_{AB} = MSAB/MSE ~ F((a−1)(b−1), ab(n−1))
· · ·
V4
Experimental Designs
CRD · RBD · LSD & Factorial Designs
🎲
CRD
Completely Randomised Design
Treatments randomly assigned to all units with no restrictions. Simplest design — use when units are homogeneous. Analysis: one-way ANOVA. df_error = N−t. Disadvantage: if units are heterogeneous, MSE will be large and F-test will be weak.
🧱
RBD
Randomised Block Design
Group similar units into blocks; randomise treatments within blocks. Removes block variation from error → smaller MSE → more powerful F-test. Fisher's golden rule: "Block what you can, randomise what you cannot." df_error = (t−1)(b−1). Widely used in agricultural, medical, and industrial experiments.
🔲
LSD
Latin Square Design — Two-Way Blocking
Controls TWO nuisance variables (rows and columns) simultaneously. A t×t square where each treatment appears exactly once in each row and column. df_error = (t−1)(t−2). Assumes no three-way interaction between rows, columns, and treatments.
🔢
2ᵏ Factorial
k Factors at 2 Levels Each
All 2ᵏ combinations of k factors (each at low/high)
Estimates all main effects AND all interactions
Fractional 2^{k-p}: half/quarter fractions to reduce runs
Yates algorithm computes all effects efficiently
😄 "The 2ᵏ design: maximum information, minimum runs — the statistician's favourite meal."
2² main effect AEffect A = [(y_a+y_ab) − (y_(1)+y_b)] / 2n
🌍 Bangladesh Agricultural TrialTesting 4 fertiliser treatments (t=4) on rice in 3 blocks (b=3) of similar soil fertility. RBD gives df_error=(4−1)(3−1)=6. Result: F=8.4 (p=0.014), η²=0.62 — treatments explain 62% of variance. Tukey post-hoc: Treatment D significantly outperforms A and B (p<0.05) but not C (p=0.12). Blocking removed soil-fertility variability, making the test sensitive enough to detect real treatment differences that a CRD might have missed.
STAT3201 · Hypothesis Testing
H₀
STAT3201 · B.Sc. Statistics Year 3 · BRUR
Hypothesis Testing
Neyman-Pearson Framework · Type I & II Errors · Power · MP & UMP Tests · Likelihood Ratio Tests · p-values · Non-Parametric Tests
🎓 The Court of Statistics
We assume H₀ is true (innocent until proven guilty) and ask: how surprising is our data if H₀ were true? If very surprising (small p-value), we reject H₀. 😄 "H₀ is like a stubborn professor — it won't budge unless the evidence is overwhelming. And even then, there's a chance you made a mistake (Type I error)." Key texts: Casella & Berger for theory; Lehmann & Romano for advanced testing.
H₀: Status quo / no effect — assumed true by default
H₁: What we're trying to demonstrate
Simple: Completely specifies distribution (μ=5)
Composite: Specifies a class (μ>5)
One vs two-sided: H₁: μ>μ₀ vs H₁: μ≠μ₀
⚖️
Error Types
Four Outcomes
✅ H₀ true, Don't reject: Correct (prob 1−α)
❌ H₀ true, Reject: Type I error α — false alarm
❌ H₀ false, Don't reject: Type II error β — missed detection
✅ H₀ false, Reject: Power = 1−β — correct detection
💡
The Tradeoff
α↓ → β↑ for Fixed n
Decreasing α (fewer false alarms) increases β (more misses) for fixed n. Only way to reduce both: increase n. Power = 1−β should be ≥ 0.80 in well-designed studies. 😄 "Demanding 99.9% confidence with n=5 is like demanding perfect night vision in complete darkness — physically impossible with so little data!"
⚠️
Key Asymmetry
H₀ and H₁ Are Not Equal
We control α directly. β depends on α, n, and the true parameter. We can NEVER "prove H₀" — only fail to reject it. "Not guilty ≠ innocent. Fail to reject H₀ ≠ H₀ is true."
Type I error αP(reject H₀ | H₀ true) — false positive; set before the test
Type II error βP(fail to reject H₀ | H₁ true) — false negative; depends on n, δ, σ
Power1 − β = P(reject H₀ | H₁ true) — ability to detect a real effect
Sample size (z-test)n = σ²(z_α + z_β)² / (μ₁−μ₀)²
· · ·
H2
Optimal Tests
Neyman-Pearson Lemma · UMP Tests & LRT
🏆
N-P Lemma
Most Powerful Test for Simple H
For H₀:θ=θ₀ vs H₁:θ=θ₁ (both simple), the Most Powerful (MP) test at level α rejects H₀ when Λ(x) = L(θ₁)/L(θ₀) > k. The N-P Lemma derives the optimal rejection region from the likelihood ratio — no guessing needed. For Gaussian data, this recovers the z-test as optimal.
🎯
UMP & MLR
Composite Alternatives
UMP test: Most powerful test for EVERY θ∈H₁ — exists for one-sided hypotheses in exponential families
MLR (Monotone Likelihood Ratio): If L(θ₁)/L(θ₀) is monotone in some statistic T(x), then rejecting for large T gives the UMP test for H₀:θ≤θ₀
Normal, Poisson, Binomial — all have MLR in their natural parameter
💡
LRT — General Tests
Wilks' Theorem
LRT: Λ = L(θ̂₀)/L(θ̂) ∈ [0,1]. Wilks (1938): −2 ln Λ → χ²(r) under H₀ where r = number of restrictions. This makes LRT applicable to ANY hypothesis. The chi-square test of independence is a special case. Reject H₀ if −2 ln Λ > χ²_α(r).
N-P MP testReject H₀ if L(θ₁;x)/L(θ₀;x) > k (k: size-α critical value)
p-value = P(T ≥ t_obs | H₀) = probability of data as extreme or more extreme than observed, assuming H₀ true. Small p → data surprising under H₀ → evidence against H₀. It is a continuous measure of evidence, NOT a binary pass/fail.
⚠️
What p is NOT
5 Common Misconceptions
❌ "P(H₀ is true)" — H₀ has no probability in frequentist stats
❌ "Probability results occurred by chance"
❌ "Probability results will replicate"
❌ Measures effect size — huge n can make trivial effects "significant"
✅ "How surprising is my data if H₀ were true?"
💡
Common Tests
Parametric Quick Reference
One-sample z: Z = (x̄−μ₀)/(σ/√n) ~ N(0,1)
One-sample t: T = (x̄−μ₀)/(s/√n) ~ t(n−1)
Paired t: T = d̄/(sD/√n) ~ t(n−1)
χ² GOF: χ² = Σ(O−E)²/E ~ χ²(k−1−p)
χ² independence: ~ χ²((r−1)(c−1))
📊
Non-Parametric Tests
When Assumptions Fail
Wilcoxon signed-rank: Non-parametric one-sample/paired t
Mann-Whitney U: Non-parametric two-sample t (ranks)
Kruskal-Wallis: Non-parametric one-way ANOVA
Spearman's ρ: Non-parametric correlation
⚠ Less powerful than parametric when assumptions hold — use as backup
p-value (two-sided)p = 2·P(T ≥ |t_obs| | H₀)
Decision ruleReject H₀ iff p < α (set α before the test!)
Mann-Whitney UU = n₁n₂ + n₁(n₁+1)/2 − R₁ (R₁ = rank sum of group 1)
😄 The p-hacking Warning"If you torture your data long enough, it will confess to anything." — Ronald Coase. Running 20 tests and reporting only the p<0.05 result guarantees a false positive. Pre-register your hypotheses before seeing the data, report ALL analyses, and always report effect sizes alongside p-values. The replication crisis in psychology was largely caused by widespread p-hacking and selective reporting. Register your analysis plan first — commit before you look!
🎓 Why Sampling?
"You don't need to eat the whole pot of soup to know if it's salty — one spoonful is enough, IF it's well stirred." That's sampling. 😄 The goal: make valid inferences about a population of N units by examining only n << N units, saving time, cost, and resources while maintaining accuracy.
Sampling frame: List of all N population units — must be complete and up-to-date
Sampling unit: The unit selected at each draw
Inclusion probability πᵢ: Probability unit i is selected
Design effect (DEFF): Ratio of actual variance to SRS variance
💡
Probability vs Non-Probability
Two Types of Sampling
Probability: Every unit has known, non-zero inclusion probability → valid inference possible. SRS, stratified, cluster, systematic.
Non-probability: Convenience, purposive, quota — no valid inference to population. Use only for exploratory work.
⚠️
Key Principle
Unbiasedness & Efficiency
Unbiased estimator: E(ȳ) = Ȳ on average
Efficiency: Smaller variance = more information per unit cost
Goal: choose design that minimises variance for given cost
· · ·
T2
Baseline Design
Simple Random Sampling (SRS)
🎲
SRSWOR vs SRSWR
With vs Without Replacement
SRSWOR: Each unit selected at most once — more common; smaller variance
SRSWR: Units can repeat — simpler theory; larger variance
Finite population correction (FPC) = (1−f) = (1−n/N) — matters when n/N > 0.05
⚙️
Estimation
Mean, Total, Proportion
ȳ = (1/n)Σyᵢ — unbiased estimator of Ȳ
ŷ_total = Nȳ — unbiased estimator of total Y
p̂ = x/n — unbiased estimator of proportion P
Var(ȳ) — SRSWORV(ȳ) = (1−f)·S²/n where f=n/N, S²=Σ(yᵢ−Ȳ)²/(N−1)
Estimated Varv(ȳ) = (1−f)·s²/n where s²=Σ(yᵢ−ȳ)²/(n−1)
95% CI for Ȳȳ ± 1.96·√v(ȳ)
Sample size nn = N·z²S² / (N·e² + z²S²) (e=desired margin of error)
· · ·
T3
Improved Efficiency
Stratified Random Sampling
🗂️
What it is
Divide & Sample
Divide population into L non-overlapping strata; take SRS within each stratum. Why? Reduces variance by removing between-stratum variation from the error. Always more efficient than SRS if strata are internally homogeneous.
⚙️
Allocation Methods
How Many from Each Stratum?
Proportional: nₕ = n·(Nₕ/N) — simple; good when σₕ similar
Optimal (Neyman): nₕ ∝ Nₕσₕ — minimises variance for fixed n
Cost-optimal: nₕ ∝ Nₕσₕ/√cₕ — accounts for variable cost per stratum
💡
When to Stratify
Good Stratification Criteria
Variable highly correlated with study variable Y
Administrative convenience (districts, regions, age groups)
Need separate estimates for subgroups (domains)
Oversampling rare subgroups for adequate representation
🌍 Bangladesh HIES ExampleHousehold Income & Expenditure Survey stratifies by division (8) × urban/rural (2) = 16 strata. Neyman allocation samples more from Dhaka (large, variable) and less from Sylhet (small, homogeneous). Result: 40% lower variance than SRS of same total size — more accurate poverty estimates at lower cost.
· · ·
T4
Practical Designs
Systematic & Cluster Sampling
📋
Systematic Sampling
Every kth Unit
k = N/n (sampling interval); select random start r ∈ {1,…,k}; then r, r+k, r+2k, …
Very easy to implement — just a list and arithmetic
Efficient when list is in random order (≈SRS)
⚠ Periodic pattern in list + periodic k = biased disaster!
🏘️
Cluster Sampling
Sample Groups, Not Individuals
Divide population into clusters; randomly select m clusters; survey ALL units in selected clusters
Cost-efficient when clusters are geographically compact
Less efficient statistically — units within cluster tend to be similar (intraclass correlation ρ)
DEFF = 1 + (b̄−1)ρ where b̄ = avg cluster size
💡
Two-Stage Cluster
Select Clusters, Then Sub-Sample
Stage 1: Select m PSUs (primary sampling units) with probability proportional to size. Stage 2: Select n SSUs within each selected PSU. Used in virtually all large national surveys (DHS, MICS, census post-enumeration). More flexible than single-stage cluster sampling.
Systematic interval kk = N/n (round to integer); sample: r, r+k, r+2k, …
Cluster mean estimatorȳ_cl = (1/m)Σᵢȳᵢ (ȳᵢ = mean of ith selected cluster)
If auxiliary variable X (known population total X̄) is highly correlated with Y: ȳ_R = R̂·X̄ where R̂=ȳ/x̄. Biased but often much lower MSE than ȳ. Best when ratio Y/X is more constant than Y itself — e.g., estimating crop yield per hectare.
⚙️
Regression Estimator
OLS-Based Improvement
ȳ_reg = ȳ + b̂(X̄−x̄) where b̂ = Σ(xᵢ−x̄)(yᵢ−ȳ)/Σ(xᵢ−x̄)². Always has smaller or equal variance than ȳ. More general than ratio estimator — doesn't require proportionality. Gain in efficiency ∝ ρ²(X,Y).
💡
When to Use Each
Ratio vs Regression vs SRS
Use ratio when Y∝X (passes through origin) and ρ>0.5
Select PSUs with probability proportional to a size measure (number of households, land area). Larger clusters have higher selection probability. Combined with equal-probability sub-sampling within PSUs → self-weighting sample. Used in almost all national surveys.
⚠️
Non-Sampling Errors
Often Bigger Than Sampling Error!
Coverage error: Frame misses units (undercoverage of homeless, migrants)
Non-response: Selected units don't participate — can cause serious bias
Measurement error: Wrong answers due to question wording, recall, interviewer bias
Processing error: Data entry, coding mistakes
😄 "A perfectly designed sample with 40% non-response is worse than a simple convenience sample for many questions."
💡
Hansen-Hurwitz Estimator
PPS with Replacement
π_i = n·Mᵢ/M₀ (selection probability). Estimator: ȳ_HH = (1/n)Σ(yᵢ/πᵢ). Unbiased. Variance ∝ variation of yᵢ/πᵢ — good PPS reduces this variation dramatically compared to SRS for skewed populations (like business surveys).
PPS prob. of selectionπᵢ = n·Mᵢ / M₀ (Mᵢ=size of unit i, M₀=total size)
HH estimatorȳ_HH = (1/n)·Σᵢ(yᵢ/πᵢ) (unbiased)
Horvitz-Thompsonŷ_HT = Σᵢ∈s (yᵢ/πᵢ) (unbiased for any design)
🎓 Why Categorical Data Analysis?
Most real-world outcomes are categorical — disease/no disease, vote/don't vote, pass/fail. You cannot use t-tests or ANOVA on counts. CDA provides the correct tools: chi-square tests for independence, odds ratios for effect size, logistic regression for prediction, and log-linear models for multi-way tables. As Agresti notes: "Categorical data analysis is arguably more important in practice than normal-theory methods."
MLE of ππ̂ = y/n (sample proportion — unbiased, consistent)
· · ·
C2
Core Tool
Contingency Tables & χ² Tests
📊
r×c Contingency Table
Cross-Tabulation
An r×c table cross-classifies n observations by two categorical variables (r rows, c columns). Cell count nᵢⱼ = observations in row i, column j. Marginal totals: nᵢ₊ (row), n₊ⱼ (column). Test: are the two variables independent?
⚙️
Pearson χ² Test
Testing Independence
H₀: rows and columns are independent (πᵢⱼ = πᵢ₊·π₊ⱼ)
Expected count: Eᵢⱼ = nᵢ₊·n₊ⱼ/n (under H₀)
χ² = Σ(nᵢⱼ−Eᵢⱼ)²/Eᵢⱼ ~ χ²((r−1)(c−1)) under H₀
⚠ Requires Eᵢⱼ ≥ 5 in all cells — use Fisher's exact if violated
💡
Likelihood Ratio G²
Alternative to χ²
G² = 2Σnᵢⱼ·ln(nᵢⱼ/Eᵢⱼ) ~ χ²((r−1)(c−1)). Also called the deviance. Preferred in log-linear model context — additive across hierarchical models. χ² and G² converge for large n; differ for small n.
⚠️
Fisher's Exact Test
Small Samples
For 2×2 tables with small expected counts: compute exact probability of observing table this extreme, conditioning on both margins fixed. p = C(n₁₊,n₁₁)·C(n₂₊,n₂₁)/C(n,n₊₁). No large-sample approximation needed.
Expected cell countEᵢⱼ = nᵢ₊·n₊ⱼ / n (under independence)
Pearson χ²X² = Σᵢⱼ(nᵢⱼ−Eᵢⱼ)²/Eᵢⱼ ~ χ²((r−1)(c−1))
Likelihood ratio G²G² = 2Σᵢⱼ nᵢⱼ·ln(nᵢⱼ/Eᵢⱼ) ~ χ²((r−1)(c−1))
dfdf = (r−1)(c−1) (for r×c independence test)
· · ·
C3
Effect Size
Measures of Association — OR, RR & φ
📐
Odds Ratio (OR)
Most Important Association Measure
OR = (n₁₁·n₂₂)/(n₁₂·n₂₁) = (odds of outcome in group 1)/(odds in group 2). OR=1 means no association. OR>1 means higher odds in group 1. OR is the natural parameter for logistic regression and case-control studies. Does not depend on marginal totals — unlike RR.
⚙️
Relative Risk (RR)
Risk Ratio for Prospective Studies
RR = (n₁₁/n₁₊) / (n₂₁/n₂₊) = risk in exposed / risk in unexposed
More intuitive than OR when outcomes are common
Only valid when row totals are fixed (prospective/cohort design)
For rare outcomes: OR ≈ RR
💡
φ and Cramér's V
Symmetric Association Measures
φ = √(χ²/n) — for 2×2 tables; ∈ [−1,1]
Cramér's V = √(χ²/(n·min(r−1,c−1))) — for r×c; ∈ [0,1]
🌍 Bangladesh TB Study2×2 table: smokers vs non-smokers, TB vs no TB. OR=3.2 (95% CI: 1.8–5.7, p<0.001). Interpretation: smokers have 3.2 times the odds of TB compared to non-smokers. Since TB is rare (<5%), OR ≈ RR: smokers have approximately 3× the risk. This is statistically significant AND clinically meaningful — OR=3.2 is a strong association.
· · ·
C4
Binary Outcomes
Logistic Regression
🔢
The Model
Logit Link Function
For binary Y∈{0,1}: log[π/(1−π)] = β₀ + β₁X₁ + … + βₖXₖ where π = P(Y=1|X). The logit link ensures predicted probabilities ∈ (0,1). Estimated by Maximum Likelihood Estimation (MLE), not OLS. Iteratively Reweighted Least Squares (IRLS) algorithm.
⚙️
Interpretation
Coefficients as Log-Odds
βⱼ = change in log-odds of Y=1 per unit increase in Xⱼ (others fixed)
exp(βⱼ) = odds ratio for 1-unit increase in Xⱼ — most interpretable
Odds RatioOR_j = exp(β̂ⱼ) — per unit increase in Xⱼ holding others fixed
Wald testz = β̂ⱼ/SE(β̂ⱼ) ~ N(0,1) (test H₀: βⱼ=0)
LR test (nested)G² = −2[ℓ(reduced) − ℓ(full)] ~ χ²(df_diff)
· · ·
C5
Multi-Way Tables
Log-Linear Models
📦
What it is
Modelling Cell Counts
Log-linear models treat cell counts as Poisson: ln(μᵢⱼ) = λ + λᵢᴬ + λⱼᴮ + λᵢⱼᴬᴮ. All variables are response variables — no distinction between X and Y. Especially useful for 3+ way tables to model partial and conditional independence structures.
⚙️
Model Hierarchy
Saturated vs Parsimonious
Saturated: All interactions included; perfect fit; df=0 — useless for testing
[AB,AC,BC]: All 2-way interactions; no 3-way
[AB,C]: A and B interact; C independent of both
[A,B,C]: Complete independence of A, B, C
Select model by G² (deviance) and AIC
💡
Link to Logistic Regression
Equivalence Result
For a 2×J table (binary Y), the log-linear model [XY, X] is exactly equivalent to logistic regression of Y on X. The association parameter in the log-linear model = the logistic regression coefficient. This provides a unified framework for all categorical models.
n subjects measured twice (before/after) or matched pairs
Only discordant pairs (b and c) carry information about change
McNemar's test: χ² = (b−c)²/(b+c) ~ χ²(1)
Odds ratio for matched pairs: OR = b/c
💡
Cochran-Mantel-Haenszel
Controlling for Confounding
CMH test: test association between X and Y controlling for a third variable Z (stratification). Combines evidence across K strata. Common OR estimate: OR_MH = Σₖ(aₖdₖ/nₖ) / Σₖ(bₖcₖ/nₖ). Essential for removing confounding in observational studies.
Research Design · Literature Review · Measurement · Questionnaire Design · Validity & Reliability · Data Collection · Report Writing · Ethics
🎓 What is Research Methodology?
Research methodology is the systematic framework for conducting scientific inquiry — it answers "HOW do we find out what we want to know?" It covers study design, measurement, data collection, analysis strategy, and reporting. As Saunders et al. describe it: "Research methodology is the theory of how research should be undertaken." 😄 "Good methodology won't save bad ideas, but bad methodology will ruin good ones."
Research is a systematic, controlled, empirical investigation of natural phenomena guided by theory and hypotheses about the relationship between variables. It is not just "searching the web" — it requires rigour, replicability, and transparency.
⚙️
Types by Purpose
Basic vs Applied vs Action
Basic/Pure: Advances knowledge without immediate application — testing theory
Applied: Solves specific practical problems — policy evaluation, product testing
Action research: Researcher is also a participant; improves practice while studying it
💡
Types by Approach
Quantitative vs Qualitative vs Mixed
Quantitative: Numbers, tests, generalisation — large n, structured data
Qualitative: Meaning, context, depth — interviews, observation, small n
Mixed methods: Combines both — sequential, concurrent, or embedded designs
✅
Types by Time
Cross-Sectional vs Longitudinal
Cross-sectional: One point in time — snapshot; cheap but no causation
Longitudinal: Same subjects over time — tracks change; expensive but causal insight
Retrospective: Past data — case-control; recall bias risk
Prospective: Follow forward — cohort; gold standard for temporal causation
· · ·
R2
Study Design
Research Design & Paradigms
🔭
Research Paradigms
Positivism, Interpretivism & Pragmatism
Positivism: Objective reality exists; can be measured; deductive; quantitative
Interpretivism: Reality is socially constructed; context matters; inductive; qualitative
Most statistics students work within a positivist paradigm
⚙️
Experimental Design
RCT — Gold Standard
Randomised Controlled Trial (RCT): Random assignment to treatment/control → allows causal inference
Quasi-experiment: No randomisation but comparison group exists (DID, RDD)
Observational: No manipulation — correlation only (unless IV, matching used)
💡
Causal Inference
Why RCTs Rule
RCT removes selection bias — treatment and control groups are identical on average (observed AND unobserved). Average Treatment Effect (ATE) = E[Y(1)−Y(0)]. Without randomisation, Y(1) and Y(0) differ systematically — we observe only one potential outcome per person (fundamental problem of causal inference).
🌍 Bangladesh Microfinance RCTBandhan microfinance RCT (Banerjee et al.): randomly assigned microcredit to some villages; compared income/consumption 2 years later. ATE estimate = positive but modest income effect. RCT design means we can confidently attribute this to the credit program — not to pre-existing differences between borrowers and non-borrowers. Landmark example of rigorous impact evaluation.
· · ·
R3
Before Data Collection
Literature Review & Hypothesis Formulation
📚
Literature Review
Why Review the Literature?
Identifies what is already known — avoid duplicating work
Locates gaps your research fills
Provides theoretical framework and conceptual models
Guides appropriate methodology and instruments
Databases: PubMed, Web of Science, Scopus, Google Scholar, JSTOR
🎯
Hypothesis Formulation
Good Hypotheses
Stated as relationship between two or more variables
Testable with available data and methods
Grounded in theory and prior literature
Null H₀: No effect/relationship — what we statistically test
Directional (one-sided): Stronger theory → directional; exploratory → two-sided
💡
PICO Framework
Structuring Research Questions
Especially in health research: Population — Intervention/Exposure — Comparison — Outcome. Example: Among Bangladeshi children under 5 (P), does exclusive breastfeeding for 6 months (I) compared to mixed feeding (C) reduce stunting rates (O)? Clear PICO prevents vague, unanswerable questions.
· · ·
R4
Measurement
Measurement, Scales & Questionnaire Design
📏
Scales of Measurement
Nominal · Ordinal · Interval · Ratio
Nominal: Gender, religion, blood type — categories only; mode appropriate
Ordinal: Education level, satisfaction rating — ranked; median appropriate
Interval: Temperature, IQ — equal intervals, no true zero; mean appropriate
Ratio: Income, weight, height — true zero; all measures appropriate
📝
Questionnaire Design
Golden Rules
Each question measures ONE thing only (no double-barrelled questions)
Use simple, clear language appropriate for target population
Avoid leading questions ("Don't you agree that…?")
Order: easy/non-sensitive first; sensitive/demographics last
Pilot test with 10–20 people before full deployment
💡
Response Scales
Likert, Semantic Differential & VAS
Likert scale: 1–5 or 1–7 agreement scale; treat as ordinal (or approximately interval for ≥5 points)
Semantic differential: Bipolar adjectives (good–bad, fast–slow) on 7-point scale
VAS (Visual Analogue Scale): 0–100mm line; continuous; good for pain, intensity
· · ·
R5
Quality Assurance
Validity, Reliability & Data Quality
🎯
Validity
Are We Measuring What We Intend?
Content validity: Items cover the full domain
Construct validity: Measures the theoretical construct (convergent + discriminant)
Criterion validity: Correlates with gold standard (concurrent + predictive)
Internal validity: Study design allows causal inference (no confounding)
External validity: Results generalise to other populations/settings
🔁
Reliability
Consistency of Measurement
Test-retest reliability: Same result on repeated measurement (Pearson r)
Inter-rater reliability: Different raters agree (Cohen's κ)
Internal consistency: Items in scale hang together (Cronbach's α ≥ 0.7)
💡
Validity vs Reliability
The Dartboard Analogy
Reliable but not valid: all darts in tight cluster but hitting the wrong target. Valid but not reliable: darts scattered but centred on the right target. Reliable AND valid: tight cluster on the correct target. Reliability is necessary but not sufficient for validity.
Methods: Study design, population, sample, instruments, analysis plan
Results: Tables, figures, statistical findings — no interpretation
Discussion: Interpret, compare with literature, limitations, implications
Conclusion: Answer the research question; recommendations
😄 Ethics Reminder"In research ethics, the three golden rules are: (1) Do not harm participants, (2) Do not lie to participants, (3) Do not lie about participants in your results. The fourth, unofficial rule: (4) Do not ONLY add your supervisor's name to the author list without their contribution — honorary authorship is a form of research misconduct." Always get IRB clearance before data collection, not after — retroactive approval doesn't exist!
📚 Reference Books
[1]
Probability and Statistical Inference
Hogg, R.V., Tanis, E.A., & Zimmerman, D.L. — John Wiley & Sons · 9th Ed.